Shape characteristic in ggplot2

Naarayanan777 · January 27, 2020, 12:53pm

plotvsE=ggplot(Training_data_set_only,aes(E,t,shape=Category))+geom_point()+geom_smooth(method = "lm",se=F)

Can you please help me with plotting a liner regression for the whole data set with representation of each category with different shapes

The group which i have mentioned in legend have further more chemicals in it.Can any one tell how to highlight a specific compound in that category column

My data look like this column named category which have several chemicals that are grouped with the name in the legend.How do i highlight a specific compound in one of the group in the plot and also have linear regression fit
PLEASE PROVIDE SOME SIMILAR CASE EXAMPLE TO UNDERSTAND IT

TheWireMonkey · January 27, 2020, 1:40pm

Move the shape argument to the aes() of geom_point. I.e ,
ggplot(training_data_set_only, aes(E,t))+
geom_point(aes(shape=Category))+
geom_smooth('lm', se=F)

Whatever is in ggplot() can be inherited by certain subsequent layers.

Matthias · January 27, 2020, 2:04pm

You can highlight a specific datapoint by adding another geom_point on top of the existing one that just contains the data you want to show. Here realised by filtering the initial dataset based on some criteria:

> data = mpg
> 
> ggplot(data, aes(x = hwy, y = cty)) +
>   geom_point(aes(shape = class),
>     width = 0.2, height = 0.2) + 
>   geom_smooth(method = "lm") + 
>   geom_point(data = filter(data, manufacturer == "audi" & model == "a4"),
>              colour = "red", size = 3)

Naarayanan777 · January 27, 2020, 2:58pm

Thanks for you help sir,but when i run the code for example n which you have sent passes an error like
Warning: Ignoring unknown parameters: width, height
Error in filter(data, manufacturer == "audi" & model == "a4") :
object 'manufacturer' not found
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion

Same for in my case too
Can you please suggest whats the problem and how can i correct it

Matthias · January 27, 2020, 3:35pm

oh try library(tidyverse) or library(dplyr) for the filter.

and yes it was my mistake, i used geom_jitter() first an replaced it with geom_point and did not tested it again... That's why the width and height aren't working!

ggplot(data, aes(x = hwy, y = cty)) +
  geom_point(aes(shape = class)) + 
  geom_smooth(method = "lm") + 
  geom_point(data = filter(data, manufacturer == "audi" & model == "a4"),
             colour = "red", size = 3)

Naarayanan777 · January 28, 2020, 10:25am

Can any one please help to generate a similar plot as above.Because the components as 1)shape characters,2) Diagonal parallel lines of 1log units above and below (1:1) 3) Linear regression for whole data set (eg.my data set have several groups if i use lm() then they produce linear line for all group)

It wold be helpful to provide a example for achieving the above case

Matthias · January 28, 2020, 11:12am

As shown above:

When putting the group = shape into the geom_point it doesn't affect the fitted line.
Diagonal lines can be introduced with geom_abline(), here the offset depends on your scale.
We cannot say more without having access to your data.

ggplot(data = iris, aes(x = Sepal.Length, 
                        # generate near 1:1 ratio
                        y = Petal.Length*(Sepal.Length-Petal.Length))) +
 geom_point(aes(shape = Species)) + 
  geom_smooth(method = "lm") +
  # diagonal line at 1:1
  geom_abline(slope = 1, intercept = 0) +
  #upper line
  geom_abline(slope = 1, intercept = 1,
              linetype = "dotted") +
  # lower line
  geom_abline(slope = 1, intercept = - 1,
              linetype = "dashed") +
  theme_bw()

Naarayanan777 · January 28, 2020, 11:23am

Thank you so sir i will try it out but one query as you said above ti highlight a specific compound use overlay of geom_point
i tried on the iris data set using
.....The above code +geom_point(iris = filter(iris,Species=="sentosa"),colour= "red")
Resulting all points tends to overlay with red points.why does it happen and how can i correct it Mr.Matthis

Matthias · January 28, 2020, 11:39am

use: geom_point(data = filter(iris,Species=="setosa"),colour= "red")

Naarayanan777 · January 28, 2020, 2:25pm

sir can you please explain why is the Petal.Length*(Sepal.Length - Petal.Length)) term sir couldn't able to understand it whats actually happening

Naarayanan777 · January 28, 2020, 2:49pm

Sir,Please help me with this
This data is generated from Multiple liner regression and a data frame has been created to store the predicted values
For the convinces this what my data look like and to be and to be plotted with 1:1 line with observed vs Predicted

Hoping for your help

andresrcs · January 28, 2020, 4:02pm

If you need more specific help, please provide a proper REPRoducible EXample (reprex) illustrating your issue.

Naarayanan777 · January 28, 2020, 4:23pm

testdata1 = tibble::tribble(
            ~observed, ~predicted_values,                              ~category,                        ~list,
                  1.6,         1.7662534,             "Monoaromatichydrocarbon",                    "Benzene",
                 1.92,          2.106053,             "Monoaromatichydrocarbon",                    "Toluene",
                 2.51,         2.4269167,             "Monoaromatichydrocarbon",                   "p-Xylene",
                 2.35,         2.4461834,             "Monoaromatichydrocarbon",                   "o-Xylene",
                 2.19,         2.4504166,             "Monoaromatichydrocarbon",               "Ethylbenzene",
                 2.82,         2.7491294,             "Monoaromatichydrocarbon",     "1,3,5-trimethylbenzene",
                  2.8,          2.765026,             "Monoaromatichydrocarbon",     "1,2,3-trimethylbenzene",
                 3.12,         3.1288376,             "Monoaromatichydrocarbon", "1,2,4,5-tetramethylbenzene",
                 2.87,         2.7956433,             "Monoaromatichydrocarbon",            "n-propylbenzene",
                 3.39,         3.1341133,             "Monoaromatichydrocarbon",             "n-butylbenzene",
                 2.25,         2.2123077, "Monoaromatichalogenatedhydrocarbon",              "Chlorobenzene",
                 2.59,         2.6237682, "Monoaromatichalogenatedhydrocarbon",        "1,2-dichlorobenzene",
                 2.65,         2.6376784, "Monoaromatichalogenatedhydrocarbon",         "1,4-dichlorobezene",
                 2.47,         2.6665618, "Monoaromatichalogenatedhydrocarbon",        "1,3-dichlorobenzene",
                 3.22,         3.0837152, "Monoaromatichalogenatedhydrocarbon",     "1,2,3-trichlorobenzene",
                 3.25,         3.0698757, "Monoaromatichalogenatedhydrocarbon",     "1,2,4-trichlorobenzene",
                 3.84,         3.4756695, "Monoaromatichalogenatedhydrocarbon", "1,2,3,4-tetrachlorobenzene",
                 3.93,         3.4918422, "Monoaromatichalogenatedhydrocarbon", "1,2,4,5-tetrachlorobenzene"
            )
head(testdata1)
#> # A tibble: 6 x 4
#>   observed predicted_values category                list                  
#>      <dbl>            <dbl> <chr>                   <chr>                 
#> 1     1.6              1.77 Monoaromatichydrocarbon Benzene               
#> 2     1.92             2.11 Monoaromatichydrocarbon Toluene               
#> 3     2.51             2.43 Monoaromatichydrocarbon p-Xylene              
#> 4     2.35             2.45 Monoaromatichydrocarbon o-Xylene              
#> 5     2.19             2.45 Monoaromatichydrocarbon Ethylbenzene          
#> 6     2.82             2.75 Monoaromatichydrocarbon 1,3,5-trimethylbenzene

The data is to be plotted with 1:1 diagonal line with linear regression..as the above graph as mentioned by Mr.Matthias.Please help me on this sir.

andresrcs · January 28, 2020, 4:36pm

This can work as a starting point

library(tidyverse)

testdata1 = tibble::tribble(
    ~observed, ~predicted_values,                              ~category,                        ~list,
    1.6,         1.7662534,             "Monoaromatichydrocarbon",                    "Benzene",
    1.92,          2.106053,             "Monoaromatichydrocarbon",                    "Toluene",
    2.51,         2.4269167,             "Monoaromatichydrocarbon",                   "p-Xylene",
    2.35,         2.4461834,             "Monoaromatichydrocarbon",                   "o-Xylene",
    2.19,         2.4504166,             "Monoaromatichydrocarbon",               "Ethylbenzene",
    2.82,         2.7491294,             "Monoaromatichydrocarbon",     "1,3,5-trimethylbenzene",
    2.8,          2.765026,             "Monoaromatichydrocarbon",     "1,2,3-trimethylbenzene",
    3.12,         3.1288376,             "Monoaromatichydrocarbon", "1,2,4,5-tetramethylbenzene",
    2.87,         2.7956433,             "Monoaromatichydrocarbon",            "n-propylbenzene",
    3.39,         3.1341133,             "Monoaromatichydrocarbon",             "n-butylbenzene",
    2.25,         2.2123077, "Monoaromatichalogenatedhydrocarbon",              "Chlorobenzene",
    2.59,         2.6237682, "Monoaromatichalogenatedhydrocarbon",        "1,2-dichlorobenzene",
    2.65,         2.6376784, "Monoaromatichalogenatedhydrocarbon",         "1,4-dichlorobezene",
    2.47,         2.6665618, "Monoaromatichalogenatedhydrocarbon",        "1,3-dichlorobenzene",
    3.22,         3.0837152, "Monoaromatichalogenatedhydrocarbon",     "1,2,3-trichlorobenzene",
    3.25,         3.0698757, "Monoaromatichalogenatedhydrocarbon",     "1,2,4-trichlorobenzene",
    3.84,         3.4756695, "Monoaromatichalogenatedhydrocarbon", "1,2,3,4-tetrachlorobenzene",
    3.93,         3.4918422, "Monoaromatichalogenatedhydrocarbon", "1,2,4,5-tetrachlorobenzene"
)

testdata1 %>% 
    ggplot(aes(x = observed, y = predicted_values)) +
    geom_point(aes(shape = category)) +
    geom_abline(slope = 1, intercept = 0) +
    geom_abline(slope = 1, intercept = 1,
                linetype = "dotted") +
    geom_abline(slope = 1, intercept = - 1,
                linetype = "dashed") +
    scale_x_continuous(limits = c(0, 4)) +
    scale_y_continuous(limits = c(0, 4)) +
    geom_smooth(method = "lm", color = "black") +
    coord_equal()
#> `geom_smooth()` using formula 'y ~ x'

Naarayanan777 · January 28, 2020, 4:52pm

ggplot(data = predict_new4,aes(x = observed, y = predicted_values)) +
geom_point(aes(shape = category)) + geom_smooth(method = "lm")+
geom_abline(slope = 1, intercept = 0) +
geom_abline(slope = 1, intercept = 1,linetype = "dotted") +
geom_abline(slope = 1, intercept = - 1,linetype = "dashed") +
scale_x_continuous(limits = c(0, 4)) +
scale_y_continuous(limits = c(0, 4)) +
geom_smooth(method = "lm", color = "black") +
coord_equal()

When i give the complete set of data these appear

Warning messages:
1: Removed 23 rows containing non-finite values (stat_smooth).
2: Removed 23 rows containing non-finite values (stat_smooth).
3: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate; you have 7. Consider specifying
shapes manually if you must have them.
4: Removed 42 rows containing missing values (geom_point).

what should i do to eliminate this sir.... could able to understand for the 7th variable but why does it removed the data

andresrcs · January 28, 2020, 4:56pm

Those are just warnings you get because you don't have an equal number of observations for al the categories and because you have too many categories for being individually represented by point shapes.

The only solution for this would be for you to rethink the way you are representing the data.

Naarayanan777 · January 28, 2020, 4:59pm

But i have all the data's equally in both predicted vs Observed....how can i give a specific shape to the 7th variable which it did not consider?
Now i get ,the reason why it removed is because of the 7th variable is not given a symbol and the receptive datas are been removed...so how can i give a specific symbol to the 7th variable sir?

Matthias · January 28, 2020, 5:04pm

There are 42 data points that do not have observations, either in one or in both of the conditions.

andresrcs · January 28, 2020, 5:05pm

Yes but you are grouping by category and you don't have an equal number of observations within each category.

Naarayanan777 · January 28, 2020, 5:13pm

No sir i get it now the amount of data related to unspecified shape has been removed.Because the number of values removed is equal to that unspecified group.So i can able to give it when given a specific shape for that variable

the polar chemical group is not assigned and thus removed

so how can i give specific symbol for that to consider