Creating histogram from data in two columns

alisha.garibaldi · May 28, 2020, 5:21pm

Hi all - I'm hoping that someone can help me with this. I have an large dataset that I need to create a histogram of, but my data is in two columns. The first column (CO) is median income (the quantitative variable I want on my x axis), the second column (CONum) is the count of the number of individuals reporting that income. I have 1000 different incomes, and each one has a count of up to 15,000, so I can't transform this data manually.
Thank you!

startz · May 28, 2020, 5:40pm

This isn't as easy as one might think. A good option that takes a little work is described at https://stackoverflow.com/questions/6957549/overlaying-histograms-with-ggplot2-in-r.

An easier, but much less attractive solution is
hist(
col1,
col = "red",
)
hist(col2,
col = "blue",
add = TRUE)
where the trick is add=TRUE in the second hist.

alisha.garibaldi · May 28, 2020, 9:07pm

Hey! Thanks for your response. This definitely helps with the end goal of what I need to do, but right now I essentially need both columns to form a single plot, not overlaying plots. I need each number in column 1 to repeat as many times as specified in column 2, and make a histogram out of the combined data.

mfherman · May 28, 2020, 9:19pm

There are a couple approaches here that could work. The first is to use the uncount() function from the tidyr package which takes data in the form you are describing and expands it where you get one row for the count in each category.

library(tidyverse)

df <- tibble(
  median_inc = seq(1, 1000000, 1000),
  n = rpois(1000, 15)
)

df %>% 
  uncount(n) %>% 
  ggplot(aes(median_inc)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another option is to just directly plot your data using geom_col(), since it sounds like you already have binned data.

df %>% 
  ggplot(aes(median_inc, n)) +
  geom_col()

^{Created on 2020-05-28 by the reprex package (v0.3.0)}

alisha.garibaldi · May 28, 2020, 10:29pm

Hey - I need to be able to keep it as a histogram. Right now I'm trying to work on using the rep function on the data. I created a data.frame co.df. The first five rows look like this:

co.df
CO CONum
1 59099 8239
2 65957 14420
3 75794 14964
4 101313 11176
5 140610 4282

I'm trying to use the following:
co.expanded <- co.df[rep(row.names(co.df), co.df$CONum),]
and I'm getting the error:
Error in rep(row.names(co.df), co.df$CONum) : invalid 'times' argument

Any idea what i'm doing wrong here? Thanks again for your help - very new to R over here.

mfherman · May 28, 2020, 10:50pm

Your code is working just fine for me! In fact, it's the equivalent of the uncount() function. See below for an example of both ways of "expanding" your count data frame.

library(tidyverse)

# set up the data
co.df <- tribble(
  ~CO, ~CONum,
  59099, 8239,
  65957, 14420,
  75794, 14964,
  101313, 11176,
  140610, 4282
  )

# using base r
co.expanded <- co.df[rep(row.names(co.df), co.df$CONum), ]

# using tidyr
co.expanded2 <- co.df %>% 
  uncount(CONum, .remove = FALSE)

# check to see if both data frames are the same
identical(co.expanded, co.expanded2)
#> [1] TRUE

# make a histogram
co.expanded %>% 
  ggplot(aes(CO)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2020-05-28 by the reprex package (v0.3.0)}

startz · May 29, 2020, 2:10am

Look at weighted.hist() in the plotrix package.

system · June 19, 2020, 2:10am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.