Zooming out of a linear regression plot

Stephan95 · August 26, 2019, 4:43pm

Dear community,

I hope you can help me out. I am doing a linear regression in RStudio for the first time and I wanted to get a nice plot of the data with a regression line.
The data I use consits out of 21.000 observations and I do not know if this is the reason, but you can not see any trends within the plot, because the amount of dots is too low or the scale is wrong?

Does anybody have an idea on how to solve this problem?

Thank you very much in advance.

This is the code I used:

plot(ESS_Datensatz_bereinigt$stfjb, ESS_Datensatz_bereinigt$happy, main= "Zusammenhang zwischen Arbeits- und Lebenszufriedenheit",
xlab= "Arbeitszufriedenheit", ylab = "Lebenszufriedenheit",
abline(lm(ESS_Datensatz_bereinigt$stfjb ~ ESS_Datensatz_bereinigt$happy), col ="red"))

I also tried it using the ggplot2- package, but that didn*t help:

andresrcs · August 26, 2019, 5:13pm

Hi

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Matthias · August 26, 2019, 6:26pm

Or can you find a way to share your used data? Probably you cannot mimick this with inbuilt data set.
Wouldn't you agree that it's a bit strange to have datapoints spreaded equally over the whole range?
To me it seems you have just multiple readouts per datapoint, just because people chose points on a scale from 1 to 10. So actually in the plot you hide some data. You could add some jitter to show the overlap, or think if another representation might be better, e.g. a boxplot or a violin-plot. Or a ballon-plot, where the size of the point actually shows the count at this location.

Stephan95 · August 26, 2019, 6:48pm

Hi Matthias, hi Andres,

thank you very much for your help!

The data I use are provided by the European Social Survey. I uploaded my dataset, I hope it works: https://www.file-upload.net/download-13702472/ESS1-8e01.sav.html

You are right: The 2 variables were raised with a 11-point likert scale.
You'll probably need the "haven" package, since this is a .sav-file.

I want to do a linear regression with "happy" (General happiness) being my dependent variable and "stfjb" (job satisfaction) being my independet variable. The idea of the work is to prove the hypothesis that the more satisfied you are with your job, the more satisfied you are with life in general.

The regression itselfs works with p < 0,001, but I just can*t plot a graph showing the regression...

Matthias · August 26, 2019, 8:27pm

Option 1: Add some jitter and transparency to have the single counts identifiable:

> ggplot(ESS_bereinigt, aes(x=as.numeric(stfjb), 
>                           y = as.numeric(happy))) + 
>   geom_jitter(width = 0.4, height = 0.4, alpha =0.05) + 
>   theme_classic() +
>   scale_x_continuous(breaks = seq(0,10,1)) +
>   scale_y_continuous(breaks = seq(0,10,1)) +
>   geom_smooth(method = "lm", na.rm=TRUE)

grafik

Matthias · August 26, 2019, 8:32pm

Option 2: Violin Plots: I purposely increased the width to show the crowding in the 8-8 area. Also the scale="count" is important to show the areas with more counts.

ggplot(ESS_bereinigt, aes(x=as.numeric(stfjb), 
                          y = as.numeric(happy))) + 
  geom_violin(aes(group = as.numeric(stfjb)),
              scale = "count", width = 1.75,
              fill = "grey50", alpha = 0.5) + 
  theme_classic() +
  scale_x_continuous(breaks = seq(0,10,1)) +
  scale_y_continuous(breaks = seq(0,10,1)) +
  geom_smooth(method = "lm")

grafik

Boxplots actually don't look good, just because the median is then also one of the numbers.

Option 3: Count the number of observations per X-Y-pair, then define the size of the points depending on the counts.

ESS_bereinigt %>%
  group_by(cntry, stfjb, happy) %>%
  mutate(happy_count = n()) %>%
ggplot(aes(x=as.numeric(stfjb), 
                          y = as.numeric(happy))) + 
  geom_point(aes(size = happy_count)) + 
  theme_classic() + scale_size_continuous(range = c(1,11)) +
  scale_x_continuous(breaks = seq(0,10,1)) +
  scale_y_continuous(breaks = seq(0,10,1)) +
  geom_smooth(method = "lm", na.rm=TRUE, size = 1.5)

grafik

PS: Maybe it's the other way around, people that are more happy in their life can also enjoy their work more!?

Stephan95 · August 27, 2019, 12:06pm

Wow, thank you very much, Matthias!

I like option 3 a lot! Is there a way to change the colours of the dots?

For example counts < 50 = red, 50= green, 100= blue and so on...? It would make it much easier for the reader to see where crowdings are....

Matthias · August 27, 2019, 8:04pm

Isn't this illustrative enough? Actually I had a mistake inside, as it counted the occurrences per country so there are still many different points per position. Still you need all the points for the linear fit but at least they should have the same property.

Just remove cntry from the "group_by", then add "colour = happy_count" to the geom_point and define the colours with "scale_colour_gradientn()", here the "values = c(0.25, 0.75, 1)" defines the position of each colour on a scale from 0 to 1. 0.25 corresponds to 500, 0.75 to 1500 and 1 to the highest value (2164 actually) Therefor 2000 is not complete blue. Sorry I am not very proficient with gradient colours...

ESS_bereinigt %>%
  group_by(stfjb, happy) %>%
  mutate(happy_count = n()) %>%
ggplot(aes(x=as.numeric(stfjb), 
                          y = as.numeric(happy))) + 
  geom_point(aes(size = happy_count,
                 colour = happy_count)) + 
  theme_classic() + scale_size_continuous(range = c(1,11)) +
  scale_x_continuous(breaks = seq(0,10,1)) +
  scale_y_continuous(breaks = seq(0,10,1)) +
  scale_colour_gradientn(colours = c("grey50", "green", "blue4"),
                         guide = "legend",
                         values = c(0.25, 0.75, 1)) +
  geom_smooth(method = "lm", na.rm=TRUE, size = 1.5)

grafik

Stephan95 · August 28, 2019, 9:07am

Thank you very much, Matthias

mara · August 28, 2019, 10:12am

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

kjhnav · September 1, 2019, 11:46am

https://stat4everyone.shinyapps.io/lowess/

I hope this will be helpful to you.
you can understand quick to see the example data.

system · September 8, 2019, 11:46am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.