Wednesday, June 01, 2016

The Name of the Youngest Ever Modern Olympics Gold Medal Winner is Unknown

In the 1900 Olympics the Dutch rowing team were short a cox. They used a rower in the semifinal, Hermanus Brockmann, but decided his 60kg weight was too much of a handicap.

So the rowers, Fran├žoise Brandt and Roelof Klein, picked a ten year old French boy (25kg) out of the crowd and asked him to cox for them.

They won the gold. And took a photo with the boy. But his identity has never been established.

Thursday, May 19, 2016

Dying at Work in the US

Dataset from the Occupational Safety & Health Administration, OHSA, track workplace fatalities in the US. They have CSVs records of the workplace deaths a year in the US, that they release publicly.

The data contains the date, location and a description for 4000 fatalities over five years. I created columns for state, zipcode, number of people and cause.

The most common interesting words in these descriptions are

  • 813 fell
  • 708 struck
  • 642 truck
  • 452 falling
  • 382 crushed
  • 352 head
  • 263 roof
  • 261 tree
  • 258 electrocuted
  • 244 ladder
  • 238 vehicle
  • 226 trailer
  • 197 machine
  • 186 collapsed
  • 180 forklift

Not common but interesting

  • 10 lightning
  • 48 shot
  • 4 dog
  • 2 bees

and here is a map I made of the states where they happen

I have created a repository to try augment the OSHA data and clean it up when errors are found.

The repository is on github here.

If you use it I'll give you edit rights and you can help improve it

Sunday, May 15, 2016

Handpicked by amazon

Whenever I check some product on Amazon for the next few days I get the product in the advertisements on Facebook

Handpicked?

Why would Amazon lie like this?

Thursday, April 21, 2016

Can you Judge a Book by its Cover?

"they've all got the same covers, and I thought they were all o' one sample, as you may say. But it seems one mustn't judge by th' outside. This is a puzzlin' world." The Mill on the Floss by George Eliot
What is the correlation between peoples ratings of a books cover and the ratings the book receives? This post is about a game devised to get people to rate book covers and gives some great visualisations comparing a books goodreads rating to its cover rating. They gathered over 3 million ratings of 100 covers.

I took their data and got the average rating for each of the covers they tested. I then scraped these 100 books Goodreads average ratings, number of ratings and number of reviews. The Data table and the code I used to scrape and aggregate is here. There are all sorts of accuracy warnings you can imagine around these results. The main ones being that the books and their covers all look pretty good to me. They are not on the self published fan fiction end of the market. The variables here are. num_ratings: Number of Goodreads ratings. rating: average rating of the book. num_reviews: Number of people who have actually written a review. cover_rating: The average rating people gave the cover of the book.

> cor(rating,cover_rating)

[1] 0.1609114

> cor(num_ratings,num_reviews)

[1] 0.9597442

> cor(rating,num_ratings)

[1] 0.2141307

> cor(rating,num_reviews)

[1] 0.2658916

> cor(num_ratings,cover_rating)

[1] 0.3059627

> cor(num_reviews,cover_rating)

[1] 0.3307553

So no you can't judge a book by its cover the correlation in ratings is only .16. You can guess the number of ratings by the number of reviews. You can't guess how highly rated a book is by the number of ratings. Having a good cover might increase the number of reviews your book gets by a bit.

The conclusion is you shouldn't judge a book by its cover. Or by its number of sales (ratings). But people probably do judge books by their cover a bit.

Monday, March 07, 2016

Maps to hide places

Logaskino was a military base in Siberia. Over 30 years Soviet mapmakers moved it around maps to throw off enemies "How to lie with maps" talks about how the Soviets would move around the location of military bases on maps. These maps show one small base (now abandoned) and the local river and how it moved around on maps over 30 years in order to attempt to confuse enemies

Friday, January 22, 2016

England's Temperature in 2015

Nine days in 2015 were the hottest for that day of the year since 1772. This compares to three in 2014, though 2014 had a hotter average temperature and was the hottest year on record in the UK.

England has a collected data on daily temperature from 1772 in the Hadley Centre Central England Temperature (HadCET) dataset.

I downloaded this Hadley Centre dataset. And I followed this tutorial. Based on an original graphic by Tufte.


Here the black line is the average temerature for each day last year. The dark line in the middle is the average average temperature (95% confidence). the staw coloured bigger lines represent the highest and lowest average daily temperature ever recorded on that day since 1772. the red dots are the days in 2015 that were hotter than any other day at that time of year since 1772.

Looking at the black line that represents last years temperatures it was the Winter and Autumn that were far above average. Instead of a scorching hot summer most of the record hot days were in November and December. 2014 had the same pattern of a hot Winter. No day in 2015 was the coldest for that date in the recorded time.

Sunday, January 17, 2016

In 2100 there will be a kilometer tall building

I was in the Burj Khalifa last week. It is very big. But when will some bigger building be built? I want to look at the building height trend to see what the trend line says. Talking the wikipedia page on the Tallest Building. There are two eras shown. The religious era (1200-1901) and the Skyscraper era. I put the data in a csv here.

The Correlation here is cor(Year,Height) [1] 0.39831 which isn't much. Basically Cathedral's burned down and were replaced by a similar sized world's tallest building from 1200 until 1900.

Looking just at the Skyscraper era 1884 on. cor(Year,Height) [1] 0.9340458 which really looks like height increases by follow time. Running this as a linear regression the Kilometer tall bulding is not expected until the end of the century

linearModelVar <- lm(Height ~ Year, newdata)

linearModelVar$coefficients[[2]]*2010+linearModelVar$coefficients[[1]]

646.6246 The Burj Khalifa was much taller than any building was expected to be in 2010

linearModelVar$coefficients[[2]]*2099+linearModelVar$coefficients[[1]]

1002.799 finally a kilometer tall building in 2099

linearModelVar$coefficients[[2]]*2241+linearModelVar$coefficients[[1]]

1604.903 a Mile high tower 2241 far into the future?