Best plot option for mix of categorial/numerical data

Nemlock · September 3, 2020, 12:15pm

Dear R studio community,

I would like to propose a question I was dealing with for the past several days.
However, I still did not find a perfect solution for my problem displaying what I need in a proper, solid way. Hence, I thought I come to you, the experts, to potentially help me out on this.

The problem:
I have data of boreholes (txt file, rows & columns) that have, in total, 6 columns. The first column reflects the name of a borehole, the second its depth and the third to the last one refers to whether they contain data as given in the column header (e.g. data1: yes/no, data2: yes no, etc. up to data4).

My approach so far:
I did some research and found functions such as balloonplot (part of ggplot) to e.g. display the four categorial data types (data1 to data4) and the boreholes' names on the Y-axis, displaying bubbles, which have a size related to their depth (continuous color scale). However, of course, this does not work as it mixes categorial (yes/no) data with continuous data (depth in meters, for colour code). Furthermore, if no data is available, how should the bubble be plotted if it is related to depth? But I still want to include the depth information in the same plot.
I was thinking of a bunch of other problems but could not find a proper solution. Maybe I am just making my life to complicated with that...?

Therefore, I would like to ask whether you could give me a hint if I am on the right track with the balloonplot option or if I should consider a completely different type of plot. However, I would like to avoid standard bar plots.
The amount of boreholes, hence rows of my file, is 70 (which make X- or Y-axis quite long).

Thank you very much for your consideration in advance and have a nice day!

nirgrahamuk · September 3, 2020, 12:50pm

What information are you wanting to communicate about boreholes ?
as a reader, what would I want to have an understanding of after I looked at your plot ?

jrkrideau · September 3, 2020, 1:02pm

Hi,
Great descripion of the data layout though some sample data would be handy. See FAQ: How to do a minimal reproducible example ( reprex ) for beginners for some suggestiins on how to provide some. Perhaps just a sample of the data set in dput() format is all that is needed. It saves us mocking up a data set and it better to work with real data.

The real issue here is what substantive question are you asking, in general terms? What the question is determines the choice of display. It sounds like you want to display 5 pieces of information per borehole at a single point on a graph, one numeric and 4 categorical.

I think it can be done in theory but I am not sure how easily. Would it be feasible to display the data in multiple panel?

Nemlock · September 3, 2020, 1:03pm

I would want the reader to know the name of the borehole (as it will be referenced in the text several times), if it contains data1, data2, data3, data4 or a combination of them and what kind of depth each borehole has by still showing the type of data (data1, data2, etc.).

jrkrideau · September 3, 2020, 1:06pm

Would a table layout, like this https://www.bing.com/images/search?view=detailV2&ccid=ljYu%2BlxL&id=1E1CFCAE12D63FCC5066B3D9ACECAEB425378888&thid=OIP.ljYu-lxLk8eeIVSZ2lFE_wHaE8&mediaurl=https%3A%2F%2Fwinvector.files.wordpress.com%2F2020%2F06%2F8feb8-plot3_2.png%3Fw%3D656&exph=400&expw=600&q=plot+multivariate+data++Cleveland+elements+of+graphing&simid=608030900958921019&ck=288EAA9D96AC65A4E842C376A3B13777&selectedIndex=4&FORM=IRPRST&ajaxhist=0

suitably jazzed up to deal with the categorical nature of the data do any good?

Nemlock · September 3, 2020, 1:10pm

Thank you for your answer! I was just writing the other reply, so I am coming back to you know.
You are absolutely right.
Please find below a short abstract of the data. The RStudio file is packed with lots of try/errors from my side (probably most of them is bull■■■■).

I was already thinking of using 0 and 1 instead of "yes" or "no" for the type of data (data1 to data4). With this, I could keep data purely numerical. However, I do not want R to interpret it as 0 or 100 % (or minimum/maximum) but simply: is it available (or not).

This is the TXT file I am working with (structure-wise):

Well_name	Depth_m	Chemical	Mineralogical	Geotechnical	Petrophysical
XYZNAME	66.4	no	no	yes	no
ABCNAME	62	no	no	no	no
FFGNAME	30.5	no	yes	yes	no

EDIT: interestingly enough, RStudio now always plots a i..COLUMNAME in front of my very first column header, whereas the "i" contains two dots on top!?

Nemlock · September 3, 2020, 1:12pm

That might work.
But how do you replace the fraction values on the bottom? With depth values, for instance?
I guess a black spot is then drawn, when there exists data.

However, it would make the plot quite long having 70 boreholes in mind, I guess!?

nirgrahamuk · September 3, 2020, 1:43pm

With this method a user can read off a boreholes values by hover over the bar.

library(tidyverse)

set.seed(42)
(boreholes <- tibble(
  id = c(letters,LETTERS),
  depth = rnorm(52,mean=100,20),
  Chemical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Mineralogical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Geotechnical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Petrophysical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE)
))

#custom color mapping
col_by_val <- function(x){
  case_when(x=="Yes" ~1,
            x=="No" ~ .5,
            TRUE ~ 0.1)
}

boreholes_with_desc <- mutate(boreholes,
                              description = paste0("ID: ",id,
                                     "\nDepth: ",round(depth,2),
                                     "\nChemical: ",Chemical,
                                     "\nMineralogical: ",Mineralogical,
                                     "\nGeotechnical: ",Geotechnical,
                                     "\nPetrophysical: ",Petrophysical),
                              key = rgb(col_by_val(Chemical),
                                        col_by_val(Mineralogical),
                                        col_by_val(Geotechnical),
                                        col_by_val(Petrophysical)
                              )) %>% arrange(key) %>% mutate(id=forcats::as_factor(id))

library(plotly)

plot_ly(data=boreholes_with_desc,
        type="bar",
        x=~id,
        y=~depth,
        hoverinfo="text",
        text=~description,
        marker = list(color=~key),
        stroke =I("black"))

jrkrideau · September 3, 2020, 1:47pm

I have never used such a graph, I just ran into it tho other day so I have no idea how difficult it is to use.

About the fractions on the x-axis, that is easy enough. You have binary data. Yes/No is the same as 1/0. We just convert tho D1-4 columns to numeric. Presumably one could colour-code/symbol code them as well.

My guess is that that we are stuck with dots for the depth values so the reader can see the relative size of the holes but not the actual value.

Re length. What about adding a variable for the length that allows you to facet the data? Split the data into the first 30 and last 30 for example and present 2 panels?

Nemlock · September 3, 2020, 1:55pm

Thank you very much for your help - it worked!

However, this is for a plot (on paper) and not meant to be solely used on a computer, on which you could over the plot with a mouse (sorry for pointing that out just now).
Is there a chance to take the result of your code and transform it to something for a paper plot?

nirgrahamuk · September 3, 2020, 2:37pm

If you want to plot it on paper, you will probably need 4xA4 sheets.

library(tidyverse)
library(ggrepel)
set.seed(42)
(boreholes <- tibble(
  id = c(letters,LETTERS),
  depth = rnorm(52,mean=100,20),
  Chemical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Mineralogical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Geotechnical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Petrophysical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE)
))

#custom color mapping
col_by_val <- function(x){
  case_when(x=="Yes" ~1,
            x=="No" ~ .5,
            TRUE ~ 0.1)
}

boreholes_with_desc <- mutate(boreholes,
                              description = paste0("ID: ",id,
                                                   "\nDepth: ",round(depth,2),
                                                   "\nChemical: ",Chemical,
                                                   "\nMineralogical: ",Mineralogical,
                                                   "\nGeotechnical: ",Geotechnical,
                                                   "\nPetrophysical: ",Petrophysical),
                              key = rgb(col_by_val(Chemical),
                                        col_by_val(Mineralogical),
                                        col_by_val(Geotechnical),
                                        col_by_val(Petrophysical)),
                              textkey= paste0("Chemical: ",Chemical,
                                              "\nMineralogical: ",Mineralogical,
                                              "\nGeotechnical: ",Geotechnical,
                                              "\nPetrophysical: ",Petrophysical)
) %>% arrange(key) %>% mutate(id=forcats::as_factor(id))


colkey <- select(boreholes_with_desc,
                 textkey,key) %>% distinct()
png(filename="massivebarplot.png",
    width = 4200,
    height = 700,
    units = "px")
ggplot(data=boreholes_with_desc,
       mapping = aes(x=id,y=depth,label=description,
                     fill=textkey)) + geom_col(colour="black") + scale_fill_manual(
                       values=colkey$key,
                       labels=colkey$textkey
                     ) + 
  geom_label_repel(  size=4,force=20,direction="y")
dev.off()

Nemlock · September 3, 2020, 2:45pm

You are probably right. ^^

However, would it be an option to group the mineralogy, geochemical etc. data according to a colour and then add a legend to the plot according to the borehole, and each bar is then colored with 4 quarters, each representing a portion of the data, substantiated by the colour code of each data type.

Hence, one bar would be e.g. 1 color if it contains only one parameter (e.g. mineralogy data) but if a borehole contains e.g. 3 data types, the bar would split up to 33.33% each, whereas each area would be coloured with the colour of the representative data type?

Not sure, whether that is not even too complicated... !?

nirgrahamuk · September 3, 2020, 2:59pm

library(tidyverse)
library(ggrepel)
set.seed(42)
(boreholes <- tibble(
  id = c(letters,LETTERS),
  depth = rnorm(52,mean=100,20),
  Chemical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Mineralogical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Geotechnical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Petrophysical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE)
))


(blong <- pivot_longer(data=boreholes,
                       cols=3:6,
                       names_to="Type",
                       values_to="Presence") %>%
    # filter(!is.na(Presence)) %>% 
    group_by(id) %>%
  mutate(count=n(),
         portion = 1/n(),
         stackdepth=portion*depth,
         `Colour Key` = paste0(Type,": ",Presence)))

ggplot(data=blong,
       mapping = aes(x=id,y=stackdepth,
                     fill=`Colour Key`)) + geom_col(colour="black") +
labs(y="Depth")

Nemlock · September 3, 2020, 3:16pm

Thank you very much!

Unfortunately, when I run the code, nothing is happening (no plotting nor the creating of a file as in one of the codes before)?

nirgrahamuk · September 3, 2020, 3:18pm

you might not have run dev.off() from the previous example to complete the previous png, and free the graphic object.
Also, you can restart your R session.Ctrl+Shift+F10 on windows

Nemlock · September 3, 2020, 3:26pm

You are right.
It worked, thank you very much!

jrkrideau · September 3, 2020, 4:55pm

I just discovered there is no R parallel dotplot function on CRAN but the plots are usually hand-crafted. Duh.
Here is a nice one in ggplot2 from Andrew Gelman's blog Statistical Modeling, Causal Inference, and Social Science

It might be a bit more effort to implement but seems to offer a lot of flexibility and looks like it can handle the bore depth issue.

Blast it, reprex is hanging my system so I am going to have to just paste the code.

https://statmodeling.stat.columbia.edu/2020/08/30/an-example-of-a-parallel-dot-plot-a-great-way-to-display-many-properties-of-a-set-of-items/

library(ggplot2)
library(patchwork)

mtcars$car_name <- rownames(mtcars) # create new column for car names

arrange data set in order of mpg

mtcars$car_name <- factor(mtcars$car_name, levels = mtcars$car_name[order(mtcars$mpg)])
mtcars$car_name

p1 <- ggplot(mtcars, aes(x=car_name, y=mpg)) +
geom_point(stat='identity', fill="black", size=2) +
scale_y_continuous(position = "right") +
theme(plot.margin = unit(c(0.2,0,0.2,0), "cm"),
axis.line.y = element_line("black", size= 1),
axis.line.x = element_line("black", size= 1)) +
labs(title="Car mpg") +
coord_flip()

p2 <- ggplot(mtcars, aes(x=car_name, y=hp)) +
geom_point(stat='identity', fill="black", size=2) +
scale_y_continuous(position = "right") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.line.y = element_line("black", size= 1),
axis.line.x = element_line("black", size= 1),
plot.margin = unit(c(0.2,0,0.2,0), "cm")) +
labs(title="Car hp") +
coord_flip()

p3 <- ggplot(mtcars, aes(x=car_name, y=wt)) +
geom_point(stat='identity', fill="black", size=2) +
scale_y_continuous(position = "right") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.line.y = element_line("black", size= 1),
axis.line.x = element_line("black", size= 1),
plot.margin = unit(c(0.2,0,0.2,0), "cm")) +
labs(title="Car weight") +
coord_flip()

p1 + p2 + p3

Nemlock · September 4, 2020, 8:12am

Thank you very much for all your help!

system · September 25, 2020, 8:12am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.