datetime x axis labelling behaves unpredictably?

I've got a problem with a report I'm creating, where my charts have a datetime x axis and I am getting unwanted labels on the x axis (my data ends in Dec 2023 but I'm getting a Jan 2024 label created, which I believe is unwanted behaviour and confusing to my readers). I'm using date_breaks = "3 months" in my report.

I've created a reprex (below) to illustrate the issue using synthetic data. In the first three examples, data is constrained within a single calendar year, and the x axis is labelled in what I would consider a reasonable and accurate way.

In the the second half of the reprex, everything is set up the same, except there are now 24 months of data. The x axis labelling now behaves unpredictably, in my opinion. (It starts going Feb-May-Aug-Nov instead of Mar-Jun-Sep-Dec, and then if I reduce the period for the date breaks, I start getting labels for months that are outside the date range of my data - this is the issue I am having in my actual report).

The last example shows my attempt to use the limits argument to constrain the x axis labels. This is "successful" in that, but then loses months of data in a way that I do not understand.

library(ggplot2)
library(lubridate, warn.conflicts = FALSE)

t <- as.POSIXct("2021-01-01")

e1 <- as.POSIXct("2021-12-31 23:59:59")
n1 <- as.numeric(lubridate::as.duration(e1 - t))
s1 <- sample(seq.int(n1), 3000L)

d1 <- tibble::tibble(dttm = t + s1) |>
  dplyr::mutate(mon = lubridate::floor_date(dttm, "month")) |>
  dplyr::summarise(total = dplyr::n(), .by = "mon")


d1 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "3 months",
    date_labels = "%b %y"
  ) +
  labs(
    title = "x axis labelled Mar-Jun-Sep-Dec as expected",
    x = NULL,
    y = NULL
  )

d1 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "2 months",
    date_labels = "%b %y"
  ) +
  labs(title = "x axis labelled as expected", x = NULL, y = NULL) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))

d1 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "1 month",
    date_labels = "%b %y"
  ) +
  labs(title = "x axis labelled as expected", x = NULL, y = NULL) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))

e2 <- as.POSIXct("2022-12-31 23:59:59")
n2 <- as.numeric(lubridate::as.duration(e2 - t))
s2 <- sample(seq.int(n2), 6000L)

d2 <- tibble::tibble(dttm = t + s2) |>
  dplyr::mutate(mon = lubridate::floor_date(dttm, "month")) |>
  dplyr::summarise(total = dplyr::n(), .by = "mon")


d2 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "3 months",
    date_labels = "%b %y"
  ) +
  labs(title = "x axis labelled with Feb-May-Aug-Nov", x = NULL, y = NULL)

d2 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "2 months",
    date_labels = "%b %y"
  ) +
  labs(
    title = "x axis includes Jan 23 (outside the dataset)",
    x = NULL,
    y = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))

d2 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "1 month",
    date_labels = "%b %y"
  ) +
  labs(
    title = "x axis includes Dec 20 and Jan 23 (outside the dataset)",
    x = NULL,
    y = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))

d2 |>
  ggplot2::ggplot(aes(x = mon, y = total)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_datetime(
    date_breaks = "1 month",
    date_labels = "%b %y",
    limits = \(x) c(x[[1]] + month(1), x[[2]] - month(1))
  ) +
  labs(
    title = "correcting the limits works but loses 4 months of data",
    x = NULL,
    y = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))
#> Warning: Removed 4 rows containing missing values (`geom_line()`).

Created on 2024-01-30 with reprex v2.1.0

Without actually running the code I think you are just tipping over into a new period. What happens if you shift to a two month interval?

Thanks for your response. The sample data creation in my reprex is a little convoluted, I admit, but it ought to only return dttms that are within the specified period. (n1 and n2 are set up to limit the sampled dttms to the desired period). It has worked OK in my testing so far - there's shouldn't be any creep over into the January of the following year.
And in fact you can see from the charts that there's no data point associated with the January - it's just that in some cases that unwanted righthandmost major axis line gets rendered and labelled, and sometimes it doesn't.

I have run into something a bit puzzling but I wonder if it might mean anything.

I am not a great fan of dplyr and decided to load the data two tibbles into data.table. Fr some unknown reason d2 refused to translate so in frustration I wrote it to disk using the `data.frame command fwrite() and read it back in with fread() .

To my shock the original d1 and the new data.table did not match. I am no expert but it looks to me that the POSIXct value which is being displayed in a YYYY-MM-DD format actually is YYYY-MM-DD HH:MM:SS.

It may be that some dates are somehow just a bit over a date to trigger the problem. On the other hand, i may just have noticed a strange bit of useless trivia.
trivia.

I have done a bit of editing of your code and renamed d1 & d2 to dat1 & dat2 simply becase I have a few simple data exploration routines that use my variable names.

Anyway here is what I have

suppressMessages(library(data.table)); suppressMessages(library(tidyverse))

t <- as.POSIXct("2021-01-01")

e1 <- as.POSIXct("2021-12-31 23:59:59")
n1 <- as.numeric(lubridate::as.duration(e1 - t))
s1 <- sample(seq.int(n1), 3000L)

dat1 <- tibble(dttm = t + s1) |>
   mutate(mon = lubridate::floor_date(dttm, "month")) |>
   summarise(total = dplyr::n(), .by = "mon")

e2 <- as.POSIXct("2022-12-31 23:59:59")
n2 <- as.numeric(lubridate::as.duration(e2 - t))
s2 <- sample(seq.int(n2), 6000L)

dat2 <- tibble::tibble(dttm = t + s2) |>
  mutate(mon = lubridate::floor_date(dttm, "month")) |>
  summarise(total = dplyr::n(), .by = "mon")

fwrite(dat1, "dat1.csv")
fwrite(dat1, "dat2.csv")

write.csv(dat1, "raw1.csv")
write.csv(dat2, "raw2.csv")

DT1 <- fread("dat1.csv")
DT2 <- fread("dat2.csv")

raw1 <- read.csv("raw1.csv")
raw2 <- read.csv("raw2.csv")

# DT1 has hours, raw1 does not
DT1
raw1

Yeah, as.POSIXct("2021-01-01") is just a shorthand for as.POSIXct("2021-01-01 00:00:00") - if no hms is supplied then it defaults to midnight.
I'm not sure if that explains what you're seeing.

Anyway, I think my question in this thread is not really about what's in the data per se - it's a question about ggplot2 and it's labelling choices.

Yes but if you look carefully at my examples I don't think your dates are defaulting to midnight. If data.table is accurate your dates are in the early morning, either 04:00 or 05:00 .

ggplot2 labelling is something only known to Hadley :smiley:

Ah interesting.

I've amended your example code a bit, as I would prefer to use readr than base for read/write.csv.

I realised something I should have remembered in advance, and it has caught me a bit here: as.POSIXct uses the local timezone by default; lubridate::as_datetime defaults to UTC. So some of my dttms were coming out as 11pm.

suppressMessages(library(data.table)); suppressMessages(library(tidyverse))

t <- lubridate::as_datetime("2021-01-01")

e1 <- lubridate::as_datetime("2021-12-31 23:59:59")
n1 <- as.numeric(lubridate::as.duration(e1 - t))
s1 <- sample(seq.int(n1), 300L)

dat1 <- tibble(dttm = t + s1) |>
  mutate(mon = lubridate::floor_date(dttm, "month")) |>
  summarise(total = dplyr::n(), .by = "mon")

e2 <- lubridate::as_datetime("2022-12-31 23:59:59")
n2 <- as.numeric(lubridate::as.duration(e2 - t))
s2 <- sample(seq.int(n2), 600L)

dat2 <- tibble::tibble(dttm = t + s2) |>
  mutate(mon = lubridate::floor_date(dttm, "month")) |>
  summarise(total = dplyr::n(), .by = "mon")

fwrite(dat1, "dat1.csv")
fwrite(dat2, "dat2.csv")

# write.csv(dat1, "raw1.csv")
readr::write_csv(dat1, "raw1.csv")
# write.csv(dat2, "raw2.csv")
readr::write_csv(dat2, "raw2.csv")

DT1 <- fread("dat1.csv")
DT2 <- fread("dat2.csv")

raw1 <- readr::read_csv("raw1.csv")
raw2 <- readr::read_csv("raw2.csv")

# raw1 has hours, DT1 does not!
DT1
raw1

I am assuming that this is a quirk of data.table, which I don't use. Still interesting, though.
raw1 comes out as I would expect.

Ah, yes. I tend to forget that as usually my date locales are usually not that important to me so UTC is usually okay.

Data.table has its quirks to put it mildly but my question is where is it getting those hours from? It seems to be pulling more information from your tibble than I thought was there.

It may just be my imagination but I think those extra hours are taking you into a new three-month period Oct-Dec and ggplot is extending the layout into January as part of the automatic layout. Sounds a bit stupid but that's my best guess at the moment.

At the moment' I have no idea how to check it.

What specific behavior are you finding unpredictable? Are there labels missing unexpectedly, are they appearing in the wrong order, or is there something else happening?

  • When data is within a 12-month period, the x-axis labels are Mar, Jun, Sep, Dec. When the data is supplied over a 24-month period, the labelled months are now Feb, May, Aug, Nov (x2). The former makes sense (12 month data with labels at 3 month intervals, so ggplot uses the last month of each period for the label). The latter is OK but makes less sense to me (why switch to labelling the middle month of each quarter?)
  • The particular problem I am experiencing is not that labels are in the wrong order, nor that labels are missing. It is that there are additional x-axis labels included, which do not relate to any of the months in the supplied data. When supplying data from Jan 2021 - Dec 2021 (12 months) and requesting date labels at 1 month intervals, I get 12 labels, one for each month of the year. This is completely as I expected. When supplying data from Jan 2021 - Dec 2022 (24 months) and requesting date labels at 1 month intervals, I get 26 labels on the x axis including Dec 2020 and Jan 2023. This is not the behaviour I would expect. The risk is that the reader of the report could be led to believe, from looking at the x-axis, that the data extends into Dec 2020 and into Jan 2023, which is not the case. There are no data points for these months, so I would not expect the labels to be included on the x axis.
  • Furthermore, using the limits argument of scale_x_datetime() to try to correct this behaviour results in the desired x axis labels, but with 2 months of data missing from both the start and the end of the dataset.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.