Best plot option for mix of categorial/numerical data

Dear R studio community,

I would like to propose a question I was dealing with for the past several days.
However, I still did not find a perfect solution for my problem displaying what I need in a proper, solid way. Hence, I thought I come to you, the experts, to potentially help me out on this.

The problem:
I have data of boreholes (txt file, rows & columns) that have, in total, 6 columns. The first column reflects the name of a borehole, the second its depth and the third to the last one refers to whether they contain data as given in the column header (e.g. data1: yes/no, data2: yes no, etc. up to data4).

My approach so far:
I did some research and found functions such as balloonplot (part of ggplot) to e.g. display the four categorial data types (data1 to data4) and the boreholes' names on the Y-axis, displaying bubbles, which have a size related to their depth (continuous color scale). However, of course, this does not work as it mixes categorial (yes/no) data with continuous data (depth in meters, for colour code). Furthermore, if no data is available, how should the bubble be plotted if it is related to depth? But I still want to include the depth information in the same plot.
I was thinking of a bunch of other problems but could not find a proper solution. Maybe I am just making my life to complicated with that...?

Therefore, I would like to ask whether you could give me a hint if I am on the right track with the balloonplot option or if I should consider a completely different type of plot. However, I would like to avoid standard bar plots.
The amount of boreholes, hence rows of my file, is 70 (which make X- or Y-axis quite long).

Thank you very much for your consideration in advance and have a nice day!

What information are you wanting to communicate about boreholes ?
as a reader, what would I want to have an understanding of after I looked at your plot ?

Hi,
Great descripion of the data layout though some sample data would be handy. See FAQ: How to do a minimal reproducible example ( reprex ) for beginners for some suggestiins on how to provide some. Perhaps just a sample of the data set in dput() format is all that is needed. It saves us mocking up a data set and it better to work with real data.

The real issue here is what substantive question are you asking, in general terms? What the question is determines the choice of display. It sounds like you want to display 5 pieces of information per borehole at a single point on a graph, one numeric and 4 categorical.

I think it can be done in theory but I am not sure how easily. Would it be feasible to display the data in multiple panel?

I would want the reader to know the name of the borehole (as it will be referenced in the text several times), if it contains data1, data2, data3, data4 or a combination of them and what kind of depth each borehole has by still showing the type of data (data1, data2, etc.).

Would a table layout, like this https://www.bing.com/images/search?view=detailV2&ccid=ljYu%2BlxL&id=1E1CFCAE12D63FCC5066B3D9ACECAEB425378888&thid=OIP.ljYu-lxLk8eeIVSZ2lFE_wHaE8&mediaurl=https%3A%2F%2Fwinvector.files.wordpress.com%2F2020%2F06%2F8feb8-plot3_2.png%3Fw%3D656&exph=400&expw=600&q=plot+multivariate+data++Cleveland+elements+of+graphing&simid=608030900958921019&ck=288EAA9D96AC65A4E842C376A3B13777&selectedIndex=4&FORM=IRPRST&ajaxhist=0

suitably jazzed up to deal with the categorical nature of the data do any good?

Thank you for your answer! I was just writing the other reply, so I am coming back to you know.
You are absolutely right.
Please find below a short abstract of the data. The RStudio file is packed with lots of try/errors from my side (probably most of them is bull■■■■).

I was already thinking of using 0 and 1 instead of "yes" or "no" for the type of data (data1 to data4). With this, I could keep data purely numerical. However, I do not want R to interpret it as 0 or 100 % (or minimum/maximum) but simply: is it available (or not).

This is the TXT file I am working with (structure-wise):

Well_name Depth_m Chemical Mineralogical Geotechnical Petrophysical
XYZNAME 66.4 no no yes no
ABCNAME 62 no no no no
FFGNAME 30.5 no yes yes no

EDIT: interestingly enough, RStudio now always plots a i..COLUMNAME in front of my very first column header, whereas the "i" contains two dots on top!?

That might work.
But how do you replace the fraction values on the bottom? With depth values, for instance?
I guess a black spot is then drawn, when there exists data.

However, it would make the plot quite long having 70 boreholes in mind, I guess!?

With this method a user can read off a boreholes values by hover over the bar.

library(tidyverse)

set.seed(42)
(boreholes <- tibble(
  id = c(letters,LETTERS),
  depth = rnorm(52,mean=100,20),
  Chemical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Mineralogical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Geotechnical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Petrophysical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE)
))

#custom color mapping
col_by_val <- function(x){
  case_when(x=="Yes" ~1,
            x=="No" ~ .5,
            TRUE ~ 0.1)
}

boreholes_with_desc <- mutate(boreholes,
                              description = paste0("ID: ",id,
                                     "\nDepth: ",round(depth,2),
                                     "\nChemical: ",Chemical,
                                     "\nMineralogical: ",Mineralogical,
                                     "\nGeotechnical: ",Geotechnical,
                                     "\nPetrophysical: ",Petrophysical),
                              key = rgb(col_by_val(Chemical),
                                        col_by_val(Mineralogical),
                                        col_by_val(Geotechnical),
                                        col_by_val(Petrophysical)
                              )) %>% arrange(key) %>% mutate(id=forcats::as_factor(id))

library(plotly)

plot_ly(data=boreholes_with_desc,
        type="bar",
        x=~id,
        y=~depth,
        hoverinfo="text",
        text=~description,
        marker = list(color=~key),
        stroke =I("black"))

I have never used such a graph, I just ran into it tho other day so I have no idea how difficult it is to use.

About the fractions on the x-axis, that is easy enough. You have binary data. Yes/No is the same as 1/0. We just convert tho D1-4 columns to numeric. Presumably one could colour-code/symbol code them as well.

My guess is that that we are stuck with dots for the depth values so the reader can see the relative size of the holes but not the actual value.

Re length. What about adding a variable for the length that allows you to facet the data? Split the data into the first 30 and last 30 for example and present 2 panels?

Thank you very much for your help - it worked!

However, this is for a plot (on paper) and not meant to be solely used on a computer, on which you could over the plot with a mouse (sorry for pointing that out just now).
Is there a chance to take the result of your code and transform it to something for a paper plot?

If you want to plot it on paper, you will probably need 4xA4 sheets.

library(tidyverse)
library(ggrepel)
set.seed(42)
(boreholes <- tibble(
  id = c(letters,LETTERS),
  depth = rnorm(52,mean=100,20),
  Chemical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Mineralogical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Geotechnical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Petrophysical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE)
))

#custom color mapping
col_by_val <- function(x){
  case_when(x=="Yes" ~1,
            x=="No" ~ .5,
            TRUE ~ 0.1)
}

boreholes_with_desc <- mutate(boreholes,
                              description = paste0("ID: ",id,
                                                   "\nDepth: ",round(depth,2),
                                                   "\nChemical: ",Chemical,
                                                   "\nMineralogical: ",Mineralogical,
                                                   "\nGeotechnical: ",Geotechnical,
                                                   "\nPetrophysical: ",Petrophysical),
                              key = rgb(col_by_val(Chemical),
                                        col_by_val(Mineralogical),
                                        col_by_val(Geotechnical),
                                        col_by_val(Petrophysical)),
                              textkey= paste0("Chemical: ",Chemical,
                                              "\nMineralogical: ",Mineralogical,
                                              "\nGeotechnical: ",Geotechnical,
                                              "\nPetrophysical: ",Petrophysical)
) %>% arrange(key) %>% mutate(id=forcats::as_factor(id))


colkey <- select(boreholes_with_desc,
                 textkey,key) %>% distinct()
png(filename="massivebarplot.png",
    width = 4200,
    height = 700,
    units = "px")
ggplot(data=boreholes_with_desc,
       mapping = aes(x=id,y=depth,label=description,
                     fill=textkey)) + geom_col(colour="black") + scale_fill_manual(
                       values=colkey$key,
                       labels=colkey$textkey
                     ) + 
  geom_label_repel(  size=4,force=20,direction="y")
dev.off()

You are probably right. ^^

However, would it be an option to group the mineralogy, geochemical etc. data according to a colour and then add a legend to the plot according to the borehole, and each bar is then colored with 4 quarters, each representing a portion of the data, substantiated by the colour code of each data type.

Hence, one bar would be e.g. 1 color if it contains only one parameter (e.g. mineralogy data) but if a borehole contains e.g. 3 data types, the bar would split up to 33.33% each, whereas each area would be coloured with the colour of the representative data type?

Not sure, whether that is not even too complicated... !?

library(tidyverse)
library(ggrepel)
set.seed(42)
(boreholes <- tibble(
  id = c(letters,LETTERS),
  depth = rnorm(52,mean=100,20),
  Chemical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Mineralogical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Geotechnical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE),
  Petrophysical = sample(c("Yes","No",NA_character_),size = 52,replace=TRUE)
))


(blong <- pivot_longer(data=boreholes,
                       cols=3:6,
                       names_to="Type",
                       values_to="Presence") %>%
    # filter(!is.na(Presence)) %>% 
    group_by(id) %>%
  mutate(count=n(),
         portion = 1/n(),
         stackdepth=portion*depth,
         `Colour Key` = paste0(Type,": ",Presence)))

ggplot(data=blong,
       mapping = aes(x=id,y=stackdepth,
                     fill=`Colour Key`)) + geom_col(colour="black") +
labs(y="Depth")

Thank you very much!

Unfortunately, when I run the code, nothing is happening (no plotting nor the creating of a file as in one of the codes before)?

you might not have run dev.off() from the previous example to complete the previous png, and free the graphic object.
Also, you can restart your R session.Ctrl+Shift+F10 on windows

You are right.
It worked, thank you very much!

I just discovered there is no R parallel dotplot function on CRAN but the plots are usually hand-crafted. Duh.
Here is a nice one in ggplot2 from Andrew Gelman's blog Statistical Modeling, Causal Inference, and Social Science

It might be a bit more effort to implement but seems to offer a lot of flexibility and looks like it can handle the bore depth issue.

Blast it, reprex is hanging my system so I am going to have to just paste the code.

https://statmodeling.stat.columbia.edu/2020/08/30/an-example-of-a-parallel-dot-plot-a-great-way-to-display-many-properties-of-a-set-of-items/

library(ggplot2)
library(patchwork)

mtcars$car_name <- rownames(mtcars) # create new column for car names

arrange data set in order of mpg

mtcars$car_name <- factor(mtcars$car_name, levels = mtcars$car_name[order(mtcars$mpg)])
mtcars$car_name

p1 <- ggplot(mtcars, aes(x=car_name, y=mpg)) +
geom_point(stat='identity', fill="black", size=2) +
scale_y_continuous(position = "right") +
theme(plot.margin = unit(c(0.2,0,0.2,0), "cm"),
axis.line.y = element_line("black", size= 1),
axis.line.x = element_line("black", size= 1)) +
labs(title="Car mpg") +
coord_flip()

p2 <- ggplot(mtcars, aes(x=car_name, y=hp)) +
geom_point(stat='identity', fill="black", size=2) +
scale_y_continuous(position = "right") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.line.y = element_line("black", size= 1),
axis.line.x = element_line("black", size= 1),
plot.margin = unit(c(0.2,0,0.2,0), "cm")) +
labs(title="Car hp") +
coord_flip()

p3 <- ggplot(mtcars, aes(x=car_name, y=wt)) +
geom_point(stat='identity', fill="black", size=2) +
scale_y_continuous(position = "right") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.line.y = element_line("black", size= 1),
axis.line.x = element_line("black", size= 1),
plot.margin = unit(c(0.2,0,0.2,0), "cm")) +
labs(title="Car weight") +
coord_flip()

p1 + p2 + p3

Thank you very much for all your help!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.