Showing posts with label rpackage. Show all posts
Showing posts with label rpackage. Show all posts

Friday, March 22, 2019

Ireland Population Density Maps

Data from https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat which has population grid at a 1km2 population sizings for all of Europe in 2011. The code is very slightly modified version of the code developed by halhen and p0bs here https://gist.github.com/halhen/659780120accd82e043986c8b57deae0


Color Maps





Joyplots
Where population density is seen as an increase in height.




Other Countries




Wednesday, March 20, 2019

Heatmap of the world

On the 14th of June 2016 I made a heatmap of global temperatures since 1850. This got popular on reddit with 15.4K points


The picture as I said in the first comment on the post was inspired by Ed Hawkins spiral animation of world temperatures and was an attempt to show everything in one picture.
A few outlets picked up the picture. The weather channel put it on facebook without crediting me.
























Tableau I found out last year use it to advertise themselves but with no credit to me the original design

























Ed Hawkins tweeted about the graph


And in June 2016 we discussed further as I switched Time to the Y axis and made other changes. My point is he tweeted about "'heat-map' representation of global temperature changes since 1850" two years before he published his heat map representation (using stripes). My graphs used the Hadcrut-4 and HadCET datasets






In May 2018 he produced a climate graph (below) with only years not months in the heatmap. At no point has he said that my heatmaps (which were inspired by his global temperatures spiral animation) might have inspired him.




I had produced a heatmap graph with lines for years that he saw in Sept 2017


Hawkins graph has deservidely become famous. And heatmaps are one of the most common visualisation types, which is evidence he could have come up with it on his own anyway. I would like if my ones did inspire him that it got mentioned somewhere. But the big deal here is that global warming is being shown in a popular eye catching way.




Monday, February 25, 2019

February is Hot

At the moment it is really hot. 20°C is ridiculously hot for this time of year.

Central England has weather data going back a long time. 1772 for daily average temperature and 1878 for maximum and minimum temperatures on each day.

2 times February the 25th since 1772 had an average temperature for the day above 10°C in Central England 1790 with 10.7°C, and 1922 with 11°C

The maximum temperatures observed in central England on the 25th of February were 1922 with 14.1,1953 with 13.1 and 1976 with 13.1 For those 140 years the maximum daily temperature averaged 6.6C.

I put the code to work all this out here. Graph of Feb 25th max temperature shows how weird 20°C is

It takes a while for the HADCET data to be updated. There is a nearby weatherstation http://www.weathercast.co.uk/world-weather/weather-stations/obsid/99060.html when I figure out the relationship between the two it will be possible to do a comparison between today and all the previous years.

Tuesday, February 05, 2019

English soccer is not normal

Are wins in football normally distributed? If they are not it might affect how we should calculate the probabilities of teams winning.

Baseball wins seem not to follow a normal distribution

There is a great R Package dataset of football results by James Curley here. This engsoccerdata has a function to generate soccer league tables of many countries over a long time period.

       league<-maketable_all(df=england[,])
       
 

team GP W D L gf ga gd Pts Pos

1 Manchester United 962 604 209 149 1856 847 1009 2021 1

and create a new column for the percentage of wins

       league<-league %>% 
    mutate(PercentW = W / GP)

p<-ggplot(data=league, aes(league$PercentW)) + geom_histogram()
#binwidth=20
p<-p + ggtitle("Percentage wins\n in English football league") +   xlab("Percentage Wins") + ylab("Number of Teams")
p<-p+theme_update(plot.title = element_text(hjust = 0.5))
p<-p + theme_bw()
       
 

       library(fitdistrplus)
library(logspline)
x<-league$PercentW
fit.norm <- fitdist(x, "norm")
plot(fit.norm)
       
 

       shapiro.test(x)       
 
Shapiro-Wilk normality test

data: x W = 0.96276, p-value = 0.0006663 Which means English football wins really do not have a normal distribution.

Goals per game are also not normally distributed. But I dont think anyone expectes them to be

       
league<-league %>% 
    mutate(GoalsPgame = gf / GP)
shapiro.test(league$GoalsPgame)

 

Shapiro-Wilk normality test data: x W = 0.92134, p-value = 4.818e-07

And for France

Shapiro-Wilk normality test

data: leagueF$PercentW W = 0.98522, p-value = 0.4699 so French football wins do not have might have (thanks for Paulfor the correction in the comments) a normal distribution. I must check the other leagues in the dataset as behaviour this different is odd.

Friday, January 13, 2017

Irish Election Spending 2016

In the Irish election 2016 who paid the most for each vote and for each seat?
8394832.89 total spending (report here) Electorate: 3305110 so €2.50 was spent on each vote. That is under half what is spend on a US presidential vote.
On a per seat and per vote basis

And on a Per Seat Basis


Party,"Votes,1st pref.",Seats,Spending
Fine Gael,544140,50,2768881.50
Fianna Fáil,519356,44,1687916.29
Sinn Féin,295319,23,650190.38
Labour Party,140898,7,1083718.38
AAA–PBP,84168,6,266942.48
Ind 4 Change,31365,4,51669.18
Social Dem,64094,3,190586.93
Green Party,57999,2,146792.27
and the r package code is

data <-  read.csv("spending.csv", header=TRUE)
datat <- mutate(data, perV = Spending/Votes.1st.pref., perS= Spending/Seats)

q<-  ggplot(data=datat, aes(x=Party, y=perV, fill=Party)) + geom_bar(stat="identity") +      scale_fill_manual(values=c("#E5E500", "#66BB66", "#6699FF", "#99CC33", "#FFC0CB","#CC0000", "#008800", "#752F8B"))
q <-q + theme(axis.text.x = element_text(angle = 90, hjust = 1))
q <-q + theme(legend.position="none")
q <-q + labs(title = "General Election Spending 2016")
q <-q + labs(y = "Euros Per Vote")

q<-  ggplot(data=datat, aes(x=Party, y=perS, fill=Party)) + geom_bar(stat="identity") +      scale_fill_manual(values=c("#E5E500", "#66BB66", "#6699FF", "#99CC33", "#FFC0CB","#CC0000", "#008800", "#752F8B"))
q <-q + theme(axis.text.x = element_text(angle = 90, hjust = 1))
q <-q + theme(legend.position="none")
q <-q + labs(title = "General Election Spending 2016")
q <-q + labs(y = "Euros Per Seat")




Thursday, May 19, 2016

Dying at Work in the US

Dataset from the Occupational Safety & Health Administration, OHSA, track workplace fatalities in the US. They have CSVs records of the workplace deaths a year in the US, that they release publicly.

The data contains the date, location and a description for 4000 fatalities over five years. I created columns for state, zipcode, number of people and cause.

The most common interesting words in these descriptions are

  • 813 fell
  • 708 struck
  • 642 truck
  • 452 falling
  • 382 crushed
  • 352 head
  • 263 roof
  • 261 tree
  • 258 electrocuted
  • 244 ladder
  • 238 vehicle
  • 226 trailer
  • 197 machine
  • 186 collapsed
  • 180 forklift

Not common but interesting

  • 10 lightning
  • 48 shot
  • 4 dog
  • 2 bees

and here is a map I made of the states where they happen

I have created a repository to try augment the OSHA data and clean it up when errors are found.

The repository is on github here.

If you use it I'll give you edit rights and you can help improve it

Friday, January 22, 2016

England's Temperature in 2015

Nine days in 2015 were the hottest for that day of the year since 1772. This compares to three in 2014, though 2014 had a hotter average temperature and was the hottest year on record in the UK.

England has a collected data on daily temperature from 1772 in the Hadley Centre Central England Temperature (HadCET) dataset.

I downloaded this Hadley Centre dataset. And I followed this tutorial. Based on an original graphic by Tufte.


Here the black line is the average temerature for each day last year. The dark line in the middle is the average average temperature (95% confidence). the staw coloured bigger lines represent the highest and lowest average daily temperature ever recorded on that day since 1772. the red dots are the days in 2015 that were hotter than any other day at that time of year since 1772.

Looking at the black line that represents last years temperatures it was the Winter and Autumn that were far above average. Instead of a scorching hot summer most of the record hot days were in November and December. 2014 had the same pattern of a hot Winter. No day in 2015 was the coldest for that date in the recorded time.

Tuesday, November 17, 2015

How Good Will the Upcoming Stephen King Adaptations be?

This is the third and final post in a series on Stephen King adaptations. (The first is Are Stephen King films better than the books?, and the second Do Stephen King's Better Books Make Better Films?)
According to IMDB the Stephen King Novels below are currently being adapted
Titlegood doubled
The Talisman8.2
Rose Madder7.22
Lisey's Story7.22
Mr. Mercedes7.74
Gerald's Game6.86
Cell7.22
11.22.638.52
The Dark Tower8
Three of these books have exactly the same rating on Goodreads. The correlation between the ratings and his books and his films previously discussed I will use to predict future movies ratings. The analysis gives a 95% confidence prediction intervals of
Titleestimatelower boundupper bound
The Talisman7.56.48.6
Rose Madder5.954.87.0
Lisey's Story5.954.87.0
Mr. Mercedes6.795.78.8
Gerald's Game5.374.26.4
Cell5.954.87.0
11.22.638.056.99.15
The Dark Tower7.26.18.2
The upcoming films with their predicted IMDB ratings are in red in the graph below. Three are predicted to have a rating of 5.95.

There is an interesting 538 podcast here about a company that predicts film earnings. He mentions the correlation between film quality and earnings and film quality. This whole topic is an interesting challenge for prediction.

Stephen King is unique among authors in the number and variety of adaptations his works have gone through. So he is possibly the only author this could be even tried with. I am really looking forward to 11.22.63 and the Talisman now. And if Gerald's Game beats the predicted IMDB score that is a bonus.




Code


mydata = read.csv("King.csv")  
library(ggplot2)
attach(mydata)     # attach the data frame 
king.lm = lm(imdb ~ good.doubled)

Call:
lm(formula = imdb ~ good.doubled)

Coefficients:
 (Intercept)  good.doubled  
      -5.821         1.628  

upcoming = read.csv("upcoming.csv")  
predict(king.lm, upcoming, interval="predict")

> predict(king.lm, upcoming, interval="predict")

Title,Publication date,Pages,imdb,goodreads,good doubled,clr
Carrie,05/04/1974,199,7.4,3.89,7.78,1
Salem's Lot,17/10/1975,439,6.8,3.97,7.94,1
The Shining,28/01/1977,447,8.4,4.12,8.24,1
The Stand,Sep-78,823,7.3,4.32,8.64,1
The Dead Zone,Aug-79,428,7.3,3.88,7.76,1
Firestarter,29/09/1980,426,6,3.8,7.6,1
Cujo,08/09/1981,319,6,3.61,7.22,1
The Running Man,May-82,219,6.6,3.74,7.48,1
Christine,29/04/1983,526,6.6,3.69,7.38,1
Pet Sematary,14/11/1983,374,6.6,3.86,7.72,1
Thinner,19/11/1984,309,5.7,3.6,7.2,1
It,15/09/1986,1138,6.9,4.12,8.24,1
Misery,08/06/1987,310,7.8,4.06,8.12,1
The Tommyknockers,10/11/1987,558,5.4,3.42,6.84,1
The Dark Half,20/10/1989,431,5.9,3.71,7.42,1
Needful Things,Oct-91,690,6.2,3.84,7.68,1
Dolores Claiborne,Nov-92,305,7.4,3.76,7.52,1
The Green Mile,March–August 1996,400,8.5,4.39,8.78,1
Bag of Bones,22/09/1998,529,5.8,3.84,7.68,1
Dreamcatcher,20/03/2001,620,5.5,3.53,7.06,1
Under the Dome,10/11/2009,1074,6.8,3.89,7.78,1
Shawshank Redemption,10/11/2009,181,9.3,4.51,9.02,1
Stand by me,10/11/2009,80,8.1,4.25,8.5,1
The Mist,10/11/2009,230,7.2,3.88,7.76,1
The Langoliers,10/11/2009,230,6.1,3.71,7.42,1
Apt Pupil,1983,179,6.7,3.8,7.7,1
Hearts in Atlantis,2000,640,6.9,3.77,7.54,1
The Talisman,na,na,7.5,4.1, 8.2,2
Rose Madder,na,na,5.95,3.61, 7.22,2
Lisey's Story,na,na,5.95,3.61, 7.22,2
Mr. Mercedes,na,na,6.79,3.87, 7.74,2
Gerald's Game,na,na,5.37,3.43, 6.86,2
Cell,na,na,5.95,3.61, 7.22,2
11.22.63,na,na,8.05,4.26, 8.52,2
The Dark Tower,na,na,7.2,4,8,2


mydata = read.csv("King.csv")
attach(mydata)
p1 <- ggplot(mydata, aes(x=good.doubled, y=imdb)) +
    geom_point(colour = factor(clr),shape=1,size=2) +    # Use hollow circles
    geom_smooth(method=lm,se=FALSE)           
            
p1 <- p1 + ylab("IMDB Ratings")  
p1 <- p1 + xlab("GoodReads Ratings")  
p1 <- p1 + ggtitle("Upcoming Stephen King Adaptations")
p1 <- p1 + annotate("text", x = 6.97, y = 5.26, label = "The Tommyknockers", size=3, colour="blue3")  
p1 <- p1 + annotate("text", x = 7.85, y = 7.42, label = "Carrie", size=3, colour="blue3") 
#Salem's Lot,17/10/1975,439,6.8,3.97,7.94,1
p1 <- p1 + annotate("text", x =8.0 , y =6.9, label = "Salem's Lot", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 7.8, y = 6.5, label = "Pet Sematary", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.34, y = 8.51, label = "The Shining", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.62, y = 8.2, label = "Stand By Me", size=3, colour="blue")
p1 <- p1 + annotate("text", x = 8.9, y = 8.4, label = "The Green Mile", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 9.0, y = 9.09, label = "Shawshank\nRedemption" , size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.75, y = 7.3, label = "The Stand", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.27, y = 6.85 , label = "It", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.19, y = 7.74, label = "Misery", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.05, y = 6.7, label = "Under the Dome", size=3, colour="blue3") 
#Under the Dome,10/11/2009,1074,6.8,3.89,7.78,1
p1 <- p1 + annotate("segment", x = 7.9, xend = 7.79, y = 6.7, yend = 6.8, colour = "blue3") 
#Dolores Claiborne,Nov-92,305,7.4,3.76,7.52,1
p1 <- p1 + annotate("text", x = 7.5, y = 7.3, label = "Dolores Claiborne", size=3, colour="blue3") 
#The Dark Half,20/10/1989,431,5.9,3.71,7.42,1
p1 <- p1 + annotate("text", x = 7.5, y = 5.81, label = "The Dark Half", size=3, colour="blue3")
#Bag of Bones,22/09/1998,529,5.8,3.84,7.68,1
p1 <- p1 + annotate("text", x = 7.78, y = 5.7, label = "Bag of Bones", size=3, colour="blue3")
#The Dead Zone,Aug-79,428,7.3,3.88,7.76,1
p1 <- p1 + annotate("text", x = 7.5, y = 7.72, label = "The Dead Zone", size=3, colour="blue3")
p1 <- p1 + annotate("segment", x = 7.76, xend = 7.5, y = 7.33, yend = 7.66, colour = "blue3") 
#The Mist,10/11/2009,230,7.2,3.88,7.76,1
p1 <- p1 + annotate("text", x = 7.76, y = 7.1, label = "The Mist", size=3, colour="blue3")
#Firestarter,29/09/1980,426,6,3.8,7.6,1
p1 <- p1 + annotate("text", x = 7.71, y = 6, label = "Firestarter", size=3, colour="blue3")
#The Langoliers,10/11/2009,230,6.1,3.71,7.42,1
p1 <- p1 + annotate("text", x = 7.56, y = 6.11, label = "The Langoliers", size=3, colour="blue3")
#Cujo,08/09/1981,319,6,3.61,7.22,1
p1 <- p1 + annotate("text", x = 7.23, y = 6.1, label = "Cujo", size=3, colour="blue3")
#The Running Man,May-82,219,6.6,3.74,7.48,1
p1 <- p1 + annotate("text", x = 7.48, y = 6.7, label = "The Running Man", size=3, colour="blue3")
#Christine,29/04/1983,526,6.6,3.69,7.38,1
p1 <- p1 + annotate("text", x = 7.38, y = 6.5, label = "Christine", size=3, colour="blue3")
#Thinner,19/11/1984,309,5.7,3.6,7.2,1
p1 <- p1 + annotate("text", x = 7.25, y = 5.6, label = "Thinner", size=3, colour="blue3")
#Needful Things,Oct-91,690,6.2,3.84,7.68,1
p1 <- p1 + annotate("text", x = 7.83, y = 6.2, label = "Needful Things", size=3, colour="blue3")
#Dreamcatcher,20/03/2001,620,5.5,3.53,7.06,1
p1 <- p1 + annotate("text", x = 7.2, y = 5.5, label = "Dreamcatcher", size=3, colour="blue3")
#The Talisman,na,na,7.52,4.1, 8.2,2
p1 <- p1 + annotate("text", x = 8.36, y = 7.52, label = "The Talisman", size=3, colour="red3")
#Rose Madder,na,na,5.93,3.61, 7.22,2
p1 <- p1 + annotate("text", x = 7.07, y = 5.93, label = "Rose Madder", size=3, colour="red3")
#Lisey's Story,na,na,5.93,3.61, 7.22,2
p1 <- p1 + annotate("text", x = 7.07, y = 6.05, label = "Lisey's Story", size=3, colour="red3")
#Mr. Mercedes,na,na,6.78,3.87, 7.74,2
p1 <- p1 + annotate("text", x = 7.60, y = 6.81, label = "Mr. Mercedes", size=3, colour="red3")
#Gerald's Game,na,na,5.34,3.43, 6.86,2
p1 <- p1 + annotate("text", x = 7.03, y = 5.36, label = "Gerald's Game", size=3, colour="red3")
#Cell,na,na,5.93,3.61, 7.22,2
p1 <- p1 + annotate("text", x = 7.28, y = 5.92, label = "Cell", size=3, colour="red3")
#11.22.63,na,na,8.05,4.26, 8.52,2
p1 <- p1 + annotate("text", x = 8.63, y = 8.05, label = "11.22.63", size=3, colour="red3")
#The Dark Tower,na,na,7.2,4,8,2
p1 <- p1 + annotate("text", x = 8.16, y = 7.2, label = "The Dark Tower", size=3, colour="red3")
p1 <- p1 + ylab("IMDB Ratings")  
p1 <- p1 + xlab("GoodReads Ratings")
p1
ggsave("plot.png", width=10, height=10, dpi=100)
detach(mydata)

Friday, November 13, 2015

Do Stephen King's Better Books Make Better Films?

My last post (and data) got a bit popular on reddit. Some people noticed that Stephen King films ratings and the books ratings seemed highly correlated.

I think Max is right on this. So I made a graph showing how movie and book ratings are correlated

Now the actual correlation figure is


 cor(good.doubled, imdb, use="complete")
[1] [1] 0.8766 so it looks like highly rated King books make highly rated films

It could be that a good film makes people read and rate highly a book. But my basic conclusion is Stephen King's highly rated books make higher rated films















Appendix: Code for the Graph


mydata = read.csv("King.csv")
attach(mydata)
cor(good.doubled, imdb, use="complete")
p1 <- ggplot(mydata, aes(x=good.doubled, y=imdb)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm,   # Add linear regression line
                se=FALSE)    # Don't add shaded confidence region
p1 <- p1 + ylab("IMDB Ratings")  
p1 <- p1 + xlab("GoodReads Ratings *2")  
p1 <- p1 + ggtitle("Stephen King: Books vs. Movies Correlation")
p1 <- p1 + geom_text(aes(label=ifelse(good.doubled>0,as.character(Title),'')),hjust=0,just=0,size=2, position = "jitter")
p1
ggsave("correlate.png")





Other Correlations I changed the publication data column into a column of Years since 1974.


 cor(good.doubled, Years, use="complete")
[1] -0.1879052 Not a strong correlation about Kings adapted books get better or worse since he started
> cor(good.doubled, Pages, use="complete")
[1] -0.02715659 no relationship between the length of a book and it being rated highly or lowly.
> cor(Years, Pages, use="complete")
[1] 0.482048 King's adapted books may be getting a bit longer over time

Wednesday, November 11, 2015

Are Stephen King films better than the books?

Rejoice, Snobs: The Book IS Better Than The Movie
The literary originals have higher ratings than the film adaptations 74 percent of the time
. This recent piece claims books are usually better than their films versions.

This article reminded me of a claim that Stephen King films are better than his books. So I extracted the ratings from IMDB and Goodreads. I doubled the Goodreads ratings to make them out of 10.


I included miniseries. Carrie was made twice but I only included the original. The Langoliers, Maximum Overdrive, Lawnmower man and Secret garden seem not to have independent English language printed book versions.

 

The Shining is one of two films with a higher rating than the book. King famously hates the film

"Are you mystified by the cult that's grown around Kubrick's Shining?
I don't get it. But there are a lot of things that I don't get. But obviously people absolutely love it, and they don't understand why I don't. The book is hot, and the movie is cold; the book ends in fire, and the movie in ice."

 

To make this a fair test films and book would have to be graded on a curve. By what the average rating of each is. There is a good discussion on how to normalise the original comparison here. But I think this graph is enough to show Stephen King books are better than his films.

Less than 8% of King adaptations are rated higher compared to 26% usually for book adaptations. To get King to have the average quality of adaptations, sixish Stephen King Films being rated better than is books. This would require Dolores Claiborne, The Green Mile, Stand By Me, Misery and maybe Carrie to be much worse books or these already highly rated films to be much better. I am going to call this Myth Busted*, Stephen King adaptations are less successful than the average adaptation.

 

TitlePublication datePagesimdbgoodreadsgood doubled
CarrieApril 5, 19741997.43.897.78
'Salem's LotOctober 17, 19754396.83.977.94
The ShiningJanuary 28, 19774478.44.128.24
The StandSep-788237.34.328.64
The Dead ZoneAug-794287.33.887.76
FirestarterSeptember 29, 198042663.87.6
CujoSeptember 8, 198131963.617.22
The Running ManMay-822196.63.747.48
ChristineApril 29, 19835266.63.697.38
Pet SemataryNovember 14, 19833746.63.867.72
ThinnerNovember 19, 19843095.73.67.2
ItSeptember 15, 198611386.94.128.24
MiseryJune 8, 19873107.84.068.12
The TommyknockersNovember 10, 19875585.43.426.84
The Dark HalfOctober 20, 19894315.93.717.42
Needful ThingsOct-916906.23.847.68
Dolores ClaiborneNov-923057.43.767.52
The Green MileMarch–August 19964008.54.398.78
Bag of BonesSeptember 22, 19985295.83.847.68
DreamcatcherMarch 20, 20016205.53.537.06
Under the DomeNovember 10, 200910746.83.897.78
Shawshank Redemption19821819.34.519.02
Stand by me1982808.14.258.5
The Mist19832307.23.887.76
Apt Pupil19831796.73.87.7
Hearts in Atlantis20006406.93.777.54

   
 mydata = read.csv("King.csv")  
 library(ggplot2)  
 p1 <- ggplot(mydata, aes(x = good.doubled, y = imdb))  
 p1 <- p1 + geom_abline(intercept=0, slope=1)  
 p1 <- p1 + geom_point(shape=1)  
 p1 <- p1 + ylim(5, 10)  
 p1 <- p1 + xlim(5, 10)  
 
 p1 <- p1 + ylab("IMDB Ratings")  
 p1 <- p1 + xlab("GoodReads Ratings *2")  
 p1 <- p1 + ggtitle("Stephen King: Books vs. Movies")  

p1 <- p1 + annotate("text", x = 7.3, y = 5.3, label = "The Tommyknockers", size=3, colour="blue3")  
p1 <- p1 + annotate("text", x = 7.95, y = 7.37, label = "Carrie", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8, y = 6.5, label = "Pet Sematary", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 7.9, y = 8.54, label = "The Shining", size=3, colour="red3") 
p1 <- p1 + annotate("text", x = 8.75, y = 9.4, label = "Shawshank Redemption", size=3, colour="red3")
p1 <- p1 + annotate("text", x = 9.18, y = 8.4, label = "The Green Mile", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.72, y = 8, label = "Stand By Me", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.92, y = 7.3, label = "The Stand", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.3, y = 6.85 , label = "It", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.32, y = 7.7, label = "Misery", size=3, colour="blue3") 
p1 <- p1 + annotate("text", x = 8.18, y = 6.7, label = "Under the Dome", size=3, colour="blue3") 

p1 <- p1 + annotate("text", x = 9.28, y = 5.0, label = "Books Higher Rated", size=5, colour="blue3") 
p1 <- p1 + annotate("text", x = 5.7, y = 9.90, label = "Films Higher Rated", size=5, colour="red3") 



p1 <- p1 + annotate("text", x = 6.9, y = 6, label = "Cujo", size=3, colour="blue3") 
p1 <- p1 + annotate("segment", x = 7.05, xend = 7.2, y = 6.0, yend = 6.0, colour = "blue3") 


p1 <- p1 + annotate("text", x = 6.45, y = 5.52, label = "Dreamcatcher", size=3, colour="blue3") 
p1 <- p1 + annotate("segment", x = 6.8, xend = 7.05, y = 5.5, yend = 5.5, colour = "blue3") 


p1 <- p1 + annotate("text", x = 6.80, y = 7.4, label = "Dolores Claiborne", size=3, colour="blue3")
p1 <- p1 + annotate("segment", x = 7.28, xend = 7.5, y = 7.4, yend = 7.4, colour = "blue3") 

p1 <- p1 + annotate("text", x = 7.92, y = 5.72, label = "The Dark Half", size=3, colour="blue3")
p1
ggsave("plot.png")
*This needs proper statistical significance testing so not really busted.

Sunday, April 19, 2015

England's Temperature in 2014

2014 was UK's hottest year on record, says Met Office. What does that mean on a day to day basis? England has a collected data on daily temperature from 1772 in the Hadley Centre Central England Temperature (HadCET) dataset.

According to wikipedia, last year was the hottest year in this dataset with an average temperature of 10.94 °C.

I downloaded this Hadley Centre dataset. And I followed this tutorial. Based on an original graphic by Tufte. The picture is similar to the one from my post on Dublin's 2014 Weather.


Looking at the black line that represents last years temperatures it was the winter and autumn that were far above average. Instead of a scorching hot summer the three record hot days were in October and December. No day in 2014 was the coldest for that date in the recorded time.

The wikipedia page on this dataset shows the average temperature of this dataset rising over time

Saturday, February 28, 2015

What Colour are Books?

What colour are famous books?

Colours Used I counted up the occurrence of the
colours = ["red","orange","yellow","green","blue","purple","pink","brown","black","gray","white", "grey"]
in Ulysses by James Joyce. I'll post the word count code soon

red 113, orange 12, yellow 50, green 98, blue 82, purple 17, pink 21, brown 59, black 146, gray 2, white 163, grey 68

Turned this count into a barchart with r package ggplot2 graphing package

library(ggplot2)
df <- data.frame(colours = factor(c("pink","red","orange","yellow","green","blue","purple", "brown", "black", "white", "grey"), levels=c("pink","red","orange","yellow","green","blue","purple","brown", "black", "white", "grey")),
                 total_counts = c(21.0, 113.0,12.0, 50.0, 98.0, 82.0, 17.0, 59.0, 146.0,163.0,70.0))
colrs = factor(c("pink","red","orange","yellow","green","blue","purple", "brown", "black", "white", "grey"))

bp <- ggplot(data=df, aes(x=colours, y=total_counts)) + geom_bar(stat="identity",fill=colrs)+guides(fill=FALSE)
bp + theme(axis.title.x = element_blank(), axis.title.y = element_blank())+ ggtitle("Ulysses Color Counts")
bp 

There is a huge element of unweaving the rainbow in just counting the times a colour is mentioned in a book. The program distills “The sea, the snotgreen sea, the scrotumtightening sea.” into a single number. Still I think the ability to quickly look at the colour palette of a book is interesting.

The same picture made from the colours in Anna Karenina by Leo Tolstoy, Translated by Constance Garnett


Translations
Translations produce really funny graphs with this method. According to Jenks@GreekMythComix the ancient Greeks did not really use colours in the same abstract way we did. Things were not 'orange' so much as 'the colour of an orange'. The counts in the Alexander Pope translation of the Iliad are
red 36, yellow 11, green 16, blue 9, purple 43, brown 4, black 69, gray 1, white 25, grey 6

Because colours are not really mentioned in the original Iliad these sorts of graphs could be a quick way to compare translations. Google book trends does not seem to show increased use of these colours overtime.

Sunday, February 22, 2015

2014 Weather Visualizations

There is a great tutorial by Brad Boehmke here on how to build a visualization of temperature in one year compared to a dataset. The infographic is based on one by Tufte

Met Eireann have datasets going back to 1985 on their website here. Some basic data munging on the Met Eireann set for Dublin Airport and I followed the rstats code from the tutorial above to build the graphs below. Wexford would be more interesting for Sun and Kerry for Rain and Wind but those datasets would not download for me.

The first is a comparison of the temperature in 2014 compared to the same date in other years.

Next I looked at average wind speed

And finally the number of hours of sun

These visualizations doesn't look like 2014 was a particularly unusual year for Irish weather. With 30 years of past data if weather was random (which it isn't) at random around 12 days would break the high and low mark for most of these measures. Only the number of sunny days beat this metric. The data met.ie gives contains every day since 1985

maxtp: - Maximum Air Temperature (C)

mintp: - Minimum Air Temperature (C)

rain: - Precipitation Amount (mm)

wdsp: - Mean Wind Speed (knot)

hm: - Highest ten minute mean wind speed (knot)

ddhm: - Mean Wind Direction over 10 minutes at time of highest 10 minute mean (degree)

hg: - Highest Gust (knot)

sun: - Sunshine duration (hours)

dos: - Dept of Snow (cm)

g_rad - Global Radiation (j/cm sq.)

i: - Indicator

Gust might be an interesting one given the storms we had winter 2014. I put big versions of these pictures here, here and here.

Wednesday, February 04, 2015

Irish Alcohol Consumption in 2020

Drink blitz sees bottle of wine rise to €9 minimum 'Irish people still drink an annual 11.6 litres of pure alcohol per capita, 20pc lower than at the turn of the last decade. The aim is to bring down Ireland's consumption of alcohol to the OECD average of 9.1 litres in five years' time.'

What would Irish alcohol consumption be if current trends continue? Knowing this the effectiveness of new measures can be estimated.

The OECD figures are here. I put them in a .csv here.The WHO figures for alcohol consumption are here I loaded the data in R Package

datavar <- read.csv("OECDAlco.csv")

attach(datavar)

plot(Date,Value,

     main="Ireland Alcohol Consumption")
Which looks like this

Looking at that graph alcohol consumption rose from the first year we have data for 1960 until about 2000 and then started dropping. So if the trend since 2000 continued what would alcohol consumption be in 2020?

'Irish people still drink an annual 11.6 litres' I would like to see the source for this figure. We drank 11.6 litres in 2012 according to the OECD. I cannot find OECD figures for 2014. In 2004 we drank 13.6L the claimed 20pc reduction of this is 10.9L, not 11.6L. Whereas the 14.3L we drank in 2002 with a 20pc reduction would now be 11.4. This means it really looks to me like the Independent were measuring alcohol usage up to 2012.

Taking the data since 2000 until 2012.

newdata <- datavar[ which(datavar$Date > 1999), ]

detach(datavar)

attach(newdata)

plot(Date,Value,

     main="Ireland Alcohol Consumption")

cor(Date,Value)

The correlation between year and alcohol consumption since 2000 is [1] -0.9274126. It look like there is a close relationship between the year and the amount of alcohol consumed in that time. Picking 2000, near the peak of alcohol consumption, as the starting date for analysis is arguable. But 2002 was the start of this visible trend in reduced alcohol consumption.

Now I ran a linear regression to predict based on this data alcohol consumption in 2015 and 2020.

> linearModelVar <- lm(Value ~ Date, newdata)
> linearModelVar$coefficients[[2]]*2015+linearModelVar$coefficients[[1]]
[1] 10.42143
> linearModelVar$coefficients[[2]]*2020+linearModelVar$coefficients[[1]]
[1] 9.023077
> 
This means based on data from 2000-2012 we would expect people to drink 10.4 litres this year. Reducing to drinking 9 litres in 2020. So with current trends Irish alcohol consumption will be lower than 'the aim is to bring down Ireland's consumption of alcohol to the OECD average of 9.1 litres in five years'.

There could be something else that is going to alter the trend. One obvious one would be a glut of young adults. People in their 20 drink more than older people. If there are a higher proportion of youths about then the alcohol consumption will rise all else being equal. So will there be a higher proportion of people in their 20s in 5 years time?

The population pyramids projections for Ireland are here. Looking at these there seems to have been a higher proportion of young adults in 2010 than there will be in 2020 which would imply lower alcohol consumption

it would be interesting to see the data and the model that the prediction of Irish alcohol consumption are based on. And to see how minimum alcohol pricing changes the results of these models. But without seeing those models it looks like the Government strategy is promising current trends to continue in response to a new law.

Sunday, December 30, 2012

World Cup 2010 Heatmap

I am reading Visualize This by Yau at the moment. It is full of really pretty visualization ideas and examples. One it has is creating a heatmap of NBA players. To practice this visualization I have made one of World Cup 2010 players. The dataset I got from the Gardian Data blog 'World Cup 2010 statistics: every match and every player in data'. The data only has 5 qualities quantified but that is good enough to practice making heatmaps.
The R Package code I used is below
library(RColorBrewer)
#save the guardian data to world.csv and load it
players2<-read.csv('World.csv', sep=',', header=TRUE)
players2[1:3,]
#players with the same name (like Torres) meant I had to merge surnames and countries
players2$Name <-paste(players2[,1], players2[,2])
rownames(players2) <- players2$Name
###I removed one player by hand
###I now do not need these columns
players2$Position <- NULL
players2$Player.Surname <- NULL
players2$Team <- NULL
players3 <-players2[order(players2$Total.Passes, decreasing=TRUE),]
### or to order by time played
###players3 <-players2[order(players2$Time.Played, decreasing=TRUE),]
players3 <- players3[,1:5]
players4<-players3[1:50,]
players_matrix <-data.matrix(players4)
###change names of columns to make graph readable
colnames(players_matrix )[1] <- "played"
colnames(players_matrix )[2] <- "shots"
colnames(players_matrix )[3] <- "passes"
colnames(players_matrix )[4] <- "tackles"
colnames(players_matrix )[5] <- "saves"
players_heatmap <- heatmap(players_matrix, Rowv=NA, Colv=NA, col = brewer.pal(9, 'Blues'), scale='column', margins=c(5,10), main="World Cup 2010")
dev.print(file="SoccerPassed.jpeg", device=jpeg, width=600)       
#players_heatmap <- heatmap(players_matrix, Rowv=NA, Colv=NA, col = brewer.pal(9, 'Greens'), scale='column', margins=c(5,10), main="World Cup 2010")
#dev.print(file="SoccerPlayed.jpeg", device=jpeg, width=600) 
dev.off()
Nothing very fancy here. Just showing that with a good data source and some online tutorials it is easy enough to knock up a picture in a fairly short time.

Monday, December 24, 2012

The Price Of Guinness

When money's tight and hard to get 
And your horse has also ran, 
When all you have is a heap of debt - 
A PINT OF PLAIN IS YOUR ONLY MAN.
Myles Na Gopaleen

How much has Guinness increased in price over time? Below is a graph of the price changes. The data is taken from a combination of the Guinness price index and CSO data

The R package code for this graph is below.

pint<-read.csv('pintindex.csv', sep=',', header=TRUE)
plot(pint$Year, pint$Euros, type="s", main="Price Pint of Guinness in Euros", xlab="Year", ylab="Price Euros", bty="n", lwd=2)
dev.print(file="Guinness.jpeg", device=jpeg, width=600)       
dev.off() 
Paul in the comments asked a good question. How does this compare to earnings?
        price   Earnings/Price     Earnings per Week (Euro)
 2008   4.22    167.31                 706.03
 2009   4.34    161.69                 701.73
 2010   4.2     165.02                 693.08
 2011   4.15    165.81                 688.11
 2012   4.23    163.56                 691.87
Here the earnings are average weekly earnings which is the modern and slightly different value to average industrial wage which the Pint Index used. It shows that even with a price drop in Guinness the total purchasing power of pints with wages decreased. This is based on gross wages increases in tax probably made the situation based on net wages worse.

Pintindex.csv is

Year,  Euros 1969, 0.2 1973, 0.24 1976, 0.48 1979, 0.7 1983, 1.37 1984, 1.48 1985, 1.52 1986, 1.64 1987, 1.73 1988, 1.8 1989, 1.87 1990, 1.93 1991, 2.02 1992, 2.15 1993, 2.24 1994, 2.34 1995, 2.42 1996, 2.5 1997, 2.52 1998, 2.65 1999, 2.74 2000, 2.88 2001, 3.01 2002, 3.24 2003, 3.41 2004, 3.54 2005, 3.63 2006, 3.74 2007, 4.03 2008, 4.22 2009, 4.34 2010, 4.2 2011, 4.15 2012, 4.23

Wednesday, December 19, 2012

Cystic Fibrosis Improved Screening

In the first post I claimed that like Tay-Sachs in Israel Cystic Fibrosis could be drastically reduced with some relatively inexpensive genetic testing. In the second further analysis suggested that such genetic screening of the Irish population would pay for itself several times over. In this post I want to see if some form of targeted screening could be shown to be as cost effective as currently implemented screening.

Currently there is free screening for people who has relatives with CF and their partners. I assume they include second cousin as a relative. Based on this paper and some consanguinity calculations I calculate that an Irish couple with one of their second cousins has CF have about twice the chance of having a child with CF as the general population. This means you can be tested for free currently if you have about a 1 in 700 chance of having a child with cystic fibrosis whereas the general population with a 1 in 1444 chance. If a test can be focused the test so that it is twice as good as random screening that should be enough by current standards to be rolled out.

How could a non random screening be made this focused?

1. Geographic area. Some areas of the country might be more likely to have CF carriers than others. Targeting screening in these areas might make it twice as effective. The Cystic Fibrosis Registry of Ireland annual report 2010 gives numbers for Irish counties. 4 counties do not have their numbers listed but I have estimated these based on their population.

This map is based on the figures of people with CF found in the registry. This could be a biased sample or people could have moved. A better measure would be babies born with CF in each county.

Number of people with CF in each county might be useful for deciding how to allocate some treatment resources. What % of people have CF is more interesting for screening though. To work this out we first need the numbers found in each county.

The number of people with CF in the registry per ten thousand people is

I can send anyone who wants them full sized versions of these maps or the r package code I used to generate them. The code I used is below

library(RColorBrewer)
library(sp)
con <- url("http://gadm.org/data/rda/IRL_adm1.RData")
close(con)
people<-read.csv('cases.csv', sep=',', header=TRUE)
pops = cut(people$cases,breaks=c(0,2,10,20,30,40,50,70,150,300))
myPalette<-brewer.pal(9,"Purples")
spplot(gadm, "pops", col.regions=myPalette, main="Cystic Fibrosis Cases Per County",
       lwd=.4, col="black")
dev.print(file="CFIrl.jpeg", device=jpeg, width=600)
dev.off()
population<-read.csv('countypopths.csv', sep=',', header=TRUE)
pops = cut(population$population,breaks=c(0,20,40,60,70,80,100,160,400,1300))

myPalette<-brewer.pal(9,"Greens")
spplot(gadm, "pops", col.regions=myPalette, main="Population in thousands",
       lwd=.4, col="black")
dev.print(file="PopIrl.jpeg", device=jpeg, width=600)       
dev.off()

gadm$cfpop <- people$cases/(population$population/10)
cfpop = cut(gadm$cfpop,breaks=c(0,0.5,1,1.5,2,2.5,3,3.5))
gadm$cfpop <- as.factor(cfpop)

myPalette<-brewer.pal(7,"Blues")
spplot(gadm, "cfpop", col.regions=myPalette, main="CF/Population Irish Counties",
       lwd=.4, col="black")
dev.print(file="CFperPopIrl.jpeg", device=jpeg, width=600)       
dev.off() 
If this result was replicated in a more complete analysis just picking the darker counties could get you the two times amplification needed to have a test as strong as the currently paid for ones.

2. Pick certain ethnic minorities. Some groups have higher levels of CF than the average population. For example travellers have higher levels of some disorders. 'disorders, including Phenylketonuria and Cystic fibrosis, that are found in virtually all Irish communities and probably are no more common among Travellers than in the general Irish population. The second are disorders, including Galactosaemia, Glutaric Acidaemia Type I, Hurler’s Syndrome, Fanconi’s Anaemia and Type II/III Osteogenesis Imperfecta, that are found at much higher frequencies in the Traveller community than the general Irish population'. 'There is no proactive screening of the Traveller population no more than there is proactive screening of the non-traveller Irish population'. I do not think deliberate screening of one ethnic group, unless that group themselves organise it, is a good idea. Singling out one ethnic group for screening risks stigmatising its members and reminds many of the horror of eugenics.

3. Certain disorders seem to cluster with CF. 'In 1936, Guido Fanconi published a paper describing a connection between celiac disease, cystic fibrosis of the pancreas, and bronchiectasis'. Ireland also has the highest rate of celiac disease in the world (about 1 in 100). If CF and celiac disease or some other observable characteristic are also correlated in Ireland testing people with celiac disease in their family could also provide amplification of a test.

4. Screening parents undergoing IVF. HARI was the first clinic in Ireland to offer IVF and it currently receives up to 800 enquiries a year specifically about the procedure. It carries out over 1,350 cycles of IVF treatment annually and over 3,500 babies have been born as a result. The Merrion Clinic carries out up to 500 cycles of IVF per year, while last year, SIMS carried out 1,063 cycles." IVf is roughly 33% effective per cycle so this means about 1000 children are born through IVF from these three Irish clinics here each year. Screening of these parents would prevent roughly one CF case per year. Screening people who use IVF does not prevent many cases. It can be used by people who know they are CF carriers to avoid having a child with CF though.

Concerns about the privacy and security of a general genetic screening program of the Irish population should not be ignored. Cathal Garvey on twitter pointed out that this screening would require 'With explicit informed consent & ensuing destruction of samples, Just wary of prior shenanigans of HSE bloodspot program. i.e. it's already fashionable among governments to abuse screening programs to create 'law' enforcement databases. Without clear guarantees against that, must weigh the costs of mass DNA false incriminations vs. gains of ntnl screening prog!' I agree that any genetic screening program for Ireland would have to ensure privacy for the individual.

Screening the general population for carriers of serious genetic disorders would save money and suffering. If the level of savings are not sufficient for general screening focusing on certain locations or relatives of people who suffer from disorders that co-occur with CF could amplify the returns sufficiently to be as useful as current screenings.