Manipulating datasets and then merging them.

HTP · May 3, 2022, 11:01pm

Hello everyone!

I am trying to compile 12 monthly datasets (January to December) into one big yearly dataset (i.e., stack each monthly dataset on top of each other). The problem is that each dataset contains some "administrative information" that I do not need, and they mess up the format when I import the data.

For example, the January dataset looks similar to this:


January Data
Created at 2020-02-01 06:00:12
For January 2020
Date	Variable 1	Variable 2	Variable 3
2020-01-01	XX	YY	ZZ
2020-01-02	XX	YY	ZZ
2020-01-03	XX	YY	ZZ
.	.	.	.
.	.	.	.
.	.	.	.
2020-01-30	XX	YY	ZZ
2020-01-31	XX	YY	ZZ

where XX, YY, and ZZ represent some numerical values.

As you can see, the first three rows show some "administrative information" about the dataset (i.e., data name, when it was created, and which 2020 month it represents) that I do not care about. Also, this messes up the format when I import the data and creates something that looks similar to this:

JanuaryData = data.frame(`January Data`= c('Created at 2020-02-01 06:00:12',  'For January 2020', 'Date', '2020-01-01', '2020-01-02', '2020-01-03', '.', '.', '.', '2020-01-30', '2020-01-31'),
                         X = c(' ', ' ', 'Variable 1', 'XX', 'XX', 'XX','.', '.', '.','XX','XX'),
                         X1 = c(' ', ' ', 'Variable 2', 'YY', 'YY', 'YY','.', '.', '.','YY','YY'),
                         X2 = c(' ', ' ', 'Variable 3', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ'))

View(JanuaryData)

or

January.Data	X	X1	X2
Created at 2020-02-01 06:00:12
For January 2020
Date	Variable 1	Variable 2	Variable 3
2020-01-01	XX	YY	ZZ
2020-01-02	XX	YY	ZZ
2020-01-03	XX	YY	ZZ
.	.	.	.
.	.	.	.
.	.	.	.
2020-01-30	XX	YY	ZZ
2020-01-31	XX	YY	ZZ

Again, I have to merge all 12 monthly datasets, so if I try to merge this with other monthly files (For this example, let's just merge January and February datasets) , I'll get something like this:

CombinedData = data.frame(`January Data`= c('Created at 2020-02-01 06:00:12',  'For January 2020', 'Date', '2020-01-01', '2020-01-02', '2020-01-03', '.', '.', '.', '2020-01-30', '2020-01-31', 'February Data', 'Created at 2020-03-01 06:00:21', 'For February 2020', 'Date', '2020-02-01', '2020-02-02', '2020-02-03', '.', '.', '.', '2020-02-28', '2020-02-29'),
                         X = c(' ', ' ', 'Variable 1', 'XX', 'XX', 'XX','.', '.', '.','XX','XX', ' ', ' ', ' ', 'Variable 1', 'XX', 'XX', 'XX','.', '.', '.','XX','XX'),
                         X1 = c(' ', ' ', 'Variable 2', 'YY', 'YY', 'YY','.', '.', '.','YY','YY', ' ', ' ', ' ', 'Variable 2', 'YY', 'YY', 'YY','.', '.', '.','YY','YY'),
                         X2 = c(' ', ' ', 'Variable 3', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ', ' ', ' ', ' ', 'Variable 3', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ'))

View(CombinedData)

or

January.Data	X	X1	X2
Created at 2020-02-01 06:00:12
For January 2020
Date	Variable 1	Variable 2	Variable 3
2020-01-01	XX	YY	ZZ
2020-01-02	XX	YY	ZZ
2020-01-03	XX	YY	ZZ
.	.	.	.
.	.	.	.
.	.	.	.
2020-01-30	XX	YY	ZZ
2020-01-31	XX	YY	ZZ
February Data
Created at 2020-03-01 06:00:21
For February 2020
Date	Variable 1	Variable 2	Variable 3
2020-02-01	XX	YY	ZZ
2020-02-02	XX	YY	ZZ
2020-02-03	XX	YY	ZZ
.	.	.	.
.	.	.	.
.	.	.	.
2020-02-28	XX	YY	ZZ
2020-02-29	XX	YY	ZZ

Again, the first three rows of each dataset is causing a lot of trouble, so I wanna manipulate the data to get rid of these "administrative information" in each dataset, and make it look like this when I merge them:

CombinedData2 = data.frame(Date= c( '2020-01-01', '2020-01-02', '2020-01-03', '.', '.', '.', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '.', '.', '.', '2020-02-28', '2020-02-29'),
                          'Variable 1' = c('XX', 'XX', 'XX','.', '.', '.','XX','XX', 'XX', 'XX', 'XX','.', '.', '.','XX','XX'),
                          'Variable 2' = c('YY', 'YY', 'YY','.', '.', '.','YY','YY', 'YY', 'YY', 'YY','.', '.', '.','YY','YY'),
                          'Variable 3' = c('ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ'))

View(CombinedData2)

or

Date	Variable.1	Variable.2	Variable.3
2020-01-01	XX	YY	ZZ
2020-01-02	XX	YY	ZZ
2020-01-03	XX	YY	ZZ
.	.	.	.
.	.	.	.
.	.	.	.
2020-01-30	XX	YY	ZZ
2020-01-31	XX	YY	ZZ
2020-02-01	XX	YY	ZZ
2020-02-02	XX	YY	ZZ
2020-02-03	XX	YY	ZZ
.	.	.	.
.	.	.	.
.	.	.	.
2020-02-28	XX	YY	ZZ
2020-02-29	XX	YY	ZZ

That is, I want to get rid of the first three rows in each monthly dataset and make the 4th row (the row that has "Date", "Variable 1", "Variable 2", and "Variable 3" to represent the the column names. I feel like I'm not doing a great job clearly explaining this, but please let me know if you know how to solve this issue. Please note that the actual data that I'm using has over 20 variables and a bit more complex than the example shown above. I am also quite new to R, so it would be great if you could explain the steps clearly!

Thank you in advance!

andresrcs · May 3, 2022, 11:13pm

Whereas it is perfectly possible to clean up your dataframes, it would be better to properly import them without metadata from the beginning. Can you tell us from where and how are you reading this dataset into memory?

system · May 24, 2022, 11:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.