My Data Analytics Professional Certificate Project

Obioraa · October 17, 2025, 6:15pm

How do I form a data frame in r from multiple datasets, and share the new dataset through ggplot2 visualizations?

Thanks,

Benjamin

prubin · October 17, 2025, 6:39pm

Are you talking about multiple datasets with the same variables in them? Are you looking to generate a single plot with a line/bar/something else for each source dataset?

Obioraa · October 17, 2025, 7:09pm

Thanks for responding. I am try to get something like variables x, y, z from dataset A, variables b, c, d from dataset B and variables j, k, l from dataset C. Form a new dataset D then plot these variables and visualize with ggplot2. Now the date/time should be the same and from the same file/folder (mturkfitbit_export_3.12.16-4.11.16). Thanks!

prubin · October 17, 2025, 7:31pm

Is there a 1-to-1 correspondence between rows in each dataset (i.e., the first row in A goes with the first rows in B and C), or some columns that tell you which row in A matches which row in B and which row in C?

Obioraa · October 17, 2025, 9:34pm

Only common column among them is the Id (Identification) column. But it's not even uniform or in the same order. Thanks!

Obioraa · October 17, 2025, 9:45pm

By the looks of things. I don't think what I am trying to do is possible or correct. Hence, I will try something else. Thanks!

Obioraa · October 17, 2025, 10:09pm

Sir how do you separate date and time (3/12/2016 0:00) into separate columns. Assuming they are under ActivityHour column? Thanks!

prubin · October 17, 2025, 10:12pm

Try the following example (which produces your data frame D but does not do any plotting).

# Create some test data.

times <- c("10/17/25 09:30", "10/03/25 13:10", "10/08/25 04:07", "10/11/25 13:32") |>
           strptime(format = "%m/%d/%Y %H:%M")  # time stamps for the data
A <- data.frame(Time = times, x = 1:4, y = 12:15, z = -3:0, q = NA, w = c(0, 1, 0, 1))
B <- data.frame(Time = sort(times), b = 8:5, c = -2, d = (1:4) * pi)
C <- data.frame(Time = sort(times, decreasing = TRUE), j = 4:1, k = 12:15, l = c(-1, 1, 2, -1),
                m = 9:12, n = -3:0)

# Merge the dataframes by time stamp, keeping the desired columns.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

D <- inner_join(A, B, by = join_by(Time)) |>
       inner_join(C, by = join_by(Time)) |>
         select(Time, x, y, z, b, c, d, j, k, l) |>
           arrange(Time)

^{Created on 2025-10-17 with reprex v2.1.1}

The code assumes that the ID column has the same name in every dataframe ("Time" here), but that can be worked around in the join_by() calls. It does not assume that dataframes have the same row ordering, nor that all variables are to be used. If an ID value appears in one or more of A, B and C but not all of them, the data with that ID will not be included in D.

Obioraa · October 17, 2025, 10:49pm

Appreciated. Many thanks Sir!