Approximately Normal

eBay Scam?

2020-01-11T00:00:00+00:00

Last night I fell down the YouTube rabbit hole and ended up watching this video. It shows a coin collector who purchased 5 “random” grab bags from an eBay seller. The listing claims the buyer gets a random coin from a lot of 74 total coins. Also, of the 74 coins, 43 of them contain 1 oz of silver (these are worth more). It turned out that the YouTuber only ended up with 1 of the more valuable 1 oz silver coins out of the 5 coins he ordered. That outcome seemed unlikely to him so he had some concerns about whether the eBay seller was running a scam.

Curious about that outcome as well, I decided to calculate the probability of getting exactly 1 of the more valuable coins when ordered 5 total grab bags. For simplicity, I will assume the seller hasn’t sold any grab bags yet so the total number of coins available is still 74. The distribution of the number of 1 oz silver coins selected k at a time without replacement is a hypergeomtric distribution. The probability can be calculated in R like so:

dhyper(1,43,31,5)

## [1] 0.08399124

If we didn’t know this was a hypergeomtric distribution, then we could also estimate this probability using a simple simulation:

set.seed(111)

coins <- c(rep("1oz",43), rep("not_1oz",31))

success <- NULL

for(i in 1:10000){
  bag <- sample(coins, 5, replace = FALSE)
  success[i] <- ifelse(length(which(bag == "1oz")) == 1, 1, 0)
}

mean(success)

## [1] 0.0853

Does an 8% chance indicate a scam?

forestFloor Visualization

2019-06-13T00:00:00+00:00

A collaborator and I recently sumbitted a manuscript analyzing metabolomic data. Part of the analysis involed building a random forest model for classifying patient disease status using metabolite concentrations. The journal we submitted to required an image for a graphical table of contents. I decided this would be a good opportunity to show off the cool things that can be done using the forestFloor package in R.

Fantasy Football Player Rankings

2018-11-22T00:00:00+00:00

It’s Thanksgiving, and for many, that means sitting around with the family watching football after eating lots of turkey. It also means the fantasy football season is just a few weeks away from entering the playoffs. In previous years, that wouldn’t be something that I had much interest in because my fantasy teams have been terrible. This year, however, I am doing…mediocre. Despite the fact that I am currently poised to make the playoffs for once, I don’t feel like my draft or management styles have changed any since finishing last and then second to last in the two prior seasons. I also feel like I had drafted my teams the “right” way according to the experts, and that really made me wonder about the accuracy of fantasy football preseason player rankings.

With that in mind, at the start of this fantasy football season I decided to take a look at how good of a job the professional prognosticators do a predicting fantasy football performance. I found ESPN’s 2017 preseason fantasy player rankings which included opinions from five of their experts. I elected to use the PPR rankings since our league started to use that scoring system this year. For results and the end of the season, I took data from FantasyPros, instead of ESPN, because it was easier to access their data and the points and rankings seemed to match on both sites.

I had to perform a little data cleaning which mostly centered around how the sites handled apostrophes in names differently. I also decided to remove defneses and players who did not play in five or more games becuase I do not think that anyone’s rankings should be punished by injuries. The same should be said for all the poor people who drafted David Johnson with the #1 overall pick last year (like me).

Once that was done, I was ready to proceed with the analysis. I mainly wanted to examine the correlation between preseason and end-of-season player rankings. In order to do that, I calculated Spearman’s correlation and also created a scatterplot to visualize any relationship. The correlation between the two ranks was about 0.38, and you can see what the scatterplot looked like below. Note that I switched up both axes so higher ranked and better performing players should be in the top right.

The correlation is not very strong, and the scatterplot seems to back that up. There are a number of players who were significantly under-ranked as well as several over-ranked (some due to injury and some not). I think the main takeaway is that projecting individual performance in a sport like football is hard, and as an extremely unsuccessful fantasy owner, that makes me feel a little better.

As usual, the code and data can be found on my Github.

Sgt. Pepper’s Word Cloud

2018-08-05T00:00:00+00:00

As my last post hinted at, I’ve been working on scraping Beatles lyrics from the web. I was never really able to fully automate the process since the links didn’t have an easily repeatable pattern - the lyrics for each song were on a page with that song in the address. Nevertheless, I was able to complete the collection of lyrics from the “core” albums without too much hassle. Everything can be found in my beatles-lyrics repo on Github.

I have several analyses planned using this data. To get things started, I thought I’d create a simple word cloud for Sgt. Pepper’s Lonely Hearts Club Band. The code is really simple and I’m using the tm and wordcloud packages. I have the songs for each album in their own folder, and all of the album folders are in a folder called “Beatles” that is in my usual R working directory.

First, load the packages and set the working directory:

library(tm)
library(wordcloud)

setwd("Beatles/")

Next, load the data and create a corpus from the Sgt. Pepper’s lyrics before doing some basic processing:

sgtp <- VCorpus(DirSource("SgtPeppers/"))

sgtp <- tm_map(sgtp, removePunctuation)
sgtp <- tm_map(sgtp, content_transformer(tolower))
sgtp <- tm_map(sgtp, removeNumbers)
sgtp <- tm_map(sgtp, stripWhitespace)
sgtp <- tm_map(sgtp, removeWords, stopwords("english")) 

I did some transformations like: removing punctuation, removing any numbers, and converting to lower case. Importantly, I also removed “stop words” (words like “the”), but I did not do any stemming.

From there, I created the document-term matix, and only a few more lines of code were needed to produce the word cloud. Note that I am only including words with a minimum frequency of five.

dtm <- DocumentTermMatrix(sgtp)

freq <- colSums(as.matrix(dtm))

set.seed(1)
wordcloud(names(freq), freq, min.freq = 5, colors = brewer.pal(3, "Accent"))

The final result can be seen here:

There should be some more cool stuff coming from this. I know I want to do a sentiment analysis, and I also plan on trying out the tidytext package. Also, I already have at least one Shiny app planned for this data. Stay tuned!

Quirky rvest Behavior

2018-07-25T00:00:00+00:00

I’ve recently started a new project which involves scraping lyrics from the web. As before with the World Cup data, I turned to rvest to obtain the data. However, this time I noticed some weird behavior with the results.

Here is what happened:

    library(rvest)

    ppm <- read_html("http://lyrics.wikia.com/wiki/The_Beatles:Please_Please_Me")

    ppm %>% html_node('.lyricbox') %>% html_text()

    ## [1] "Last night I said these words to my girlI know you never even try, girlCome on (come on), come on (come on)Come on (come on), come on (come on)Please please me, whoa, yeahLike I please youYou don't need me to show the way, loveWhy do I always have to say \"love\"?Come on (come on), come on (come on)Come on (come on), come on (come on)Please please me, whoa, yeahLike I please youI don't wanna sound complainingBut you know there's always rain in my heart (in my heart)I do all the pleasing with youIt's so hard to reason With you, whoa yeahWhy do you make me blue?Last night I said these words to my girlI know you never even try, girlCome on (come on), come on (come on)Come on (come on), come on (come on)Please please me, whoa, yeahLike I please you(Please) me, whoa, yeahLike I please you(Please) me, whoa, yeahLike I please you\n"

As you can see, some of the words are concatenated. This will cause serious problems for a text analysis since “girlI” is not an actual word. After going back to inspect the source, it seems like this occurs whenever there is a new line.

I did some Googling to see if there was an easy way to handle this. Apparently, a few other people have encountered this same issue and posted about it on Github, but there doesn’t appear to be a fix posted as well.

It seems like this should be a simple matter to fix, but handling strings is one of my weaker programming skills. Therefore, I turned to the r/rstats subreddit for help, and fortunately u/Schrodingers-Human posted a nice solution that seems to work!

Here is my slightly modified solution taken from Schrodingers-Human:

    library(rvest)
    library(stringr)

    ppm <- read_html("http://lyrics.wikia.com/wiki/The_Beatles:Please_Please_Me")

    ppm %>%
        html_node('.lyricbox') %>%
        as.character() %>%
        str_sub(start=23, end=-39) %>%
        str_replace_all("<br>", " ")

    ## [1] "Last night I said these words to my girl I know you never even try, girl Come on (come on), come on (come on) Come on (come on), come on (come on) Please please me, whoa, yeah Like I please you  You don't need me to show the way, love Why do I always have to say \"love\"? Come on (come on), come on (come on) Come on (come on), come on (come on) Please please me, whoa, yeah Like I please you  I don't wanna sound complaining But you know there's always rain in my heart (in my heart) I do all the pleasing with you It's so hard to reason With you, whoa yeah Why do you make me blue?  Last night I said these words to my girl I know you never even try, girl Come on (come on), come on (come on) Come on (come on), come on (come on) Please please me, whoa, yeah Like I please you (Please) me, whoa, yeah Like I please you (Please) me, whoa, yeah Like I please you"

Looks much better.

World Cup Shiny App

2018-07-23T00:00:00+00:00

The data for World Cup goal times caused me a bit of a headache, and I was happy to get everything sorted out so I could start exploring something new. Nevertheless, an idea for one last visualization using this data has been stuck in my head.

My favorite plot to come out of the original analysis was the distribution of goal times broken down by year. However, there were so many years that I was never able to get the dimensions of the plot to look quite right. The histograms always seemed distorted in some way. To fix this, I decided to try an interactive Shiny app where one could select individual years to visualize.

The visualization I have in mind is really simple and shouldn’t take too much coding. And yes, that is exactly what I incorrectly assumed when I initially scraped the data. Fortunately, things went better this time.

First off, I need to load the data and necessary packages. Since this app is going to be very simple, two packages will do:

library(shiny)
library(ggplot2)

# data frame is called new.goals
load("updated_goals.RData")

Shiny apps have two main pieces: the user interface defintions in ui and the server function. I want my interface to have a title panel, year input selection on the side, and plot in the main panel. I can get that with:

ui <- fluidPage(

  headerPanel("When World Cup Goals Are Scored"),
  
  sidebarPanel(
    selectInput("var", label = "Select year:", unique(new.goals$year))
  ),
  
  mainPanel(
    plotOutput("hist")
  )
)

Then, my server function takes the user-selected year to subset the data to produce a histrogram in ggplot2.

server <- function(input, output) {
   
  selectedYear <- reactive({
    subset(new.goals, year == input$var)
  })
  
  output$hist <- renderPlot({
    ggplot(data = selectedYear(), aes(x = time)) + geom_histogram(breaks = seq(0, 125, 5), color = 'black', fill = 'steelblue3') +
      labs(x = 'Game Time', y = 'Count')
  })
  
}

Finally, running shinyApp(ui = ui, server = server) will launch the app locally.

This app is currently up and running on my shinyapps.io page. Shiny is super cool, and a little different than normal R use, so I imagine more Shiny posts will come in the future.

Visualizing When World Cup Goals Are Scored

2018-07-17T00:00:00+00:00

A surprising amount of hard work went into obtaining the simple dataset containing the game times when World Cup goals are scored, and now it is time to see if that work will pay off. Fortunately, the actual analysis I want to do is very simple and it shouldn’t take much effort using ggplot2.

Before plotting, a little data cleaning/processing is needed. The raw data has each time as a string, and stoppage time goals look something like: 45’+2’. I saw a few optoins for handling the stoppage time goals. First, I could treat the goal times more like categorical data and just have something like a “45+” group. This, however, loses a little bit of information since one can’t see exactly when the goals are scored. My next thought was to just add the times together. So 45’+2’ would become 47’. The problem with this is that a goal scored in the second minute of stoppage time during the first half and a goal scored 2 minutes into the second half would be treated the same. I ended up going with the latter option and chose to create a new variable to indicate whether a goal was scored in the first half, second half, or extra time. This would help fix the problem created by adding up the stoppage time goals. I imagine there are several more ways to address this issue, and some might prove superior to what I chose to do, but this seemed like a good solution given I just wanted to get a quick sense of the what is going on with the data.

Below is the first plot showing a histogram of all of the times when goals are scored: The distribution does not quite seem to be uniform. Interestingly, to me it looks like there is a little spike towards the end of the second half. I sort of expected this as teams might be motivated to press harder to get a goal to win or tie the game.

Next, I created a plot looking at the goal times broken down by “half” (so first, second, or extra time). Those histograms are below:

Finally, I wanted to see if the distribution looked different for different World Cups. I created separate histograms for each year in order to investigate that question. Those series of plots are here: The distribution seems for uniform in some years, and some years have some unique characteristics. For example, what happened to boost the scoring right around half time in 1998?

The cleaned data with corresponding code, as well as the code used to create all of the histograms (with a couple bonus plots), can be found on my Github.

P.S. At some point during all of this I found an excellent World Cup dataset on Kaggle that includes some other useful information. You can check that out here.

Scraping World Cup Goal Data

2018-07-03T00:00:00+00:00

Watching the World Cup over the last couple of weeks has got me wondering several things. Is the stoppage time determination completely arbitrary? What is with that magical medical spray? Will Neymar ever walk the same after that vicious injury?

Several of these questions are difficult to answer, but one potential thing I could investigate is the distribution of when goals are scored during a game. I’m interested in things like comparing the frequency of goals scored in each half and seeing if there is an increase in goals near the end of the game. As far as the statistics go, my curiosity could be satisfied with some simple plots so finding the right data is probably the only challenging thing I will need to do.

A quick Google search did not yield the exact data that I needed so I would have to compile everything myself. This wasn’t a deterrence, however, because it presented an excellent opportunity to refresh my web scraping skills in R. I decided to scrape the data from Wikipedia using the rvest package in coordination with SelectorGadget.

Here is the bit of code that got me started:

library(rvest)

years <- seq(1930, 2014, 4)
years <- years[-c(which(years == 1942), which(years == 1942)+1)]
times <- NULL
which.cup <- NULL

for(i in 1:length(years)){
	page <- paste("https://en.wikipedia.org/wiki/", years[i], "_FIFA_World_Cup")
	temp.page <- read_html(page)
	nodes <- html_nodes(temp.page, '.mw-parser-output div td small')
	output <- html_text(nodes)
	times <- c(times, output)
	which.cup <- c(which.cup, rep(years[i], length(output)))
}

At this point, things looked good. I manually checked all the times for the 1930 World Cup and everything matched. Next, I created a table displaying the total number of goals scored for each World Cup. Curiously, my results indicated that there were only 17 total goals scored in 1982, and there was no way that was right. For some reason, starting in the 1970s, the main Wikipedia entries for each World Cup stop displaying the exact times goals are scored during group play. That means I needed to get additional data from each individual group page to combine with what I already had for the knockout portion of each tournament.

That was frustrating, but it seemed like the solution would be fairly straightforward. Unfortunately, here is where things really hit the fan. So today, World Cup groups are identified by letters, but they were previously numbered. The size of the tournament has changed over the years so the number of groups has changed as well. Also, in 1982 there were two rounds of group play!

This inconsistency in the tournament format meant I had to deal with several special cases. I ended up using an inelegant “brute force” approach to writing the code for the remaining group stage data, and I am fairly confident that a more efficient solution is out there (possibly using a different data source). Nevertheless, in the end I had the data I wanted. The raw data from Wikipedia as well as the R script that I used to obtain the data can be found on Github here.

That First Blog Post About The Blog

2018-06-28T00:00:00+00:00

To get things started off, I thought I would explain some of the details about how this site came into existence. I can’t remember coming across any particular site that really served as motivation, but somehow, I read about GitHub Pages and Jekyll. I don’t really have any experience with website creation and design so I had to rely on Google and the numerous resources that others have graciously shared online to figure out how to get things to work.

Starting with the basics, I found this YouTube video which appears to be a recording of a workshop from the University of Idaho Library:

After watching most of that video and looking over the accompanying website, I decided I would take a learn-by-doing approach and start working on customizing my own site. First off, I wanted to pick a slick theme. None of the built-in Jekyll themes available straight form GitHub Pages did too much for me so I did some searching for alternatives. That’s when I came across Minimal Mistakes and the corresponding (and super helpful) Quick-Start Guide. There is a Minimal Mistakes template available on GitHub that I forked and started editing. And that brings us to here…