Hello everyone!
I am trying to compile 12 monthly datasets (January to December) into one big yearly dataset (i.e., stack each monthly dataset on top of each other). The problem is that each dataset contains some "administrative information" that I do not need, and they mess up the format when I import the data.
For example, the January dataset looks similar to this:
January Data | |||
Created at 2020-02-01 06:00:12 | |||
For January 2020 | |||
Date | Variable 1 | Variable 2 | Variable 3 |
2020-01-01 | XX | YY | ZZ |
2020-01-02 | XX | YY | ZZ |
2020-01-03 | XX | YY | ZZ |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2020-01-30 | XX | YY | ZZ |
2020-01-31 | XX | YY | ZZ |
where XX, YY, and ZZ represent some numerical values.
As you can see, the first three rows show some "administrative information" about the dataset (i.e., data name, when it was created, and which 2020 month it represents) that I do not care about. Also, this messes up the format when I import the data and creates something that looks similar to this:
JanuaryData = data.frame(`January Data`= c('Created at 2020-02-01 06:00:12', 'For January 2020', 'Date', '2020-01-01', '2020-01-02', '2020-01-03', '.', '.', '.', '2020-01-30', '2020-01-31'),
X = c(' ', ' ', 'Variable 1', 'XX', 'XX', 'XX','.', '.', '.','XX','XX'),
X1 = c(' ', ' ', 'Variable 2', 'YY', 'YY', 'YY','.', '.', '.','YY','YY'),
X2 = c(' ', ' ', 'Variable 3', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ'))
View(JanuaryData)
or
January.Data | X | X1 | X2 |
---|---|---|---|
Created at 2020-02-01 06:00:12 | |||
For January 2020 | |||
Date | Variable 1 | Variable 2 | Variable 3 |
2020-01-01 | XX | YY | ZZ |
2020-01-02 | XX | YY | ZZ |
2020-01-03 | XX | YY | ZZ |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2020-01-30 | XX | YY | ZZ |
2020-01-31 | XX | YY | ZZ |
Again, I have to merge all 12 monthly datasets, so if I try to merge this with other monthly files (For this example, let's just merge January and February datasets) , I'll get something like this:
CombinedData = data.frame(`January Data`= c('Created at 2020-02-01 06:00:12', 'For January 2020', 'Date', '2020-01-01', '2020-01-02', '2020-01-03', '.', '.', '.', '2020-01-30', '2020-01-31', 'February Data', 'Created at 2020-03-01 06:00:21', 'For February 2020', 'Date', '2020-02-01', '2020-02-02', '2020-02-03', '.', '.', '.', '2020-02-28', '2020-02-29'),
X = c(' ', ' ', 'Variable 1', 'XX', 'XX', 'XX','.', '.', '.','XX','XX', ' ', ' ', ' ', 'Variable 1', 'XX', 'XX', 'XX','.', '.', '.','XX','XX'),
X1 = c(' ', ' ', 'Variable 2', 'YY', 'YY', 'YY','.', '.', '.','YY','YY', ' ', ' ', ' ', 'Variable 2', 'YY', 'YY', 'YY','.', '.', '.','YY','YY'),
X2 = c(' ', ' ', 'Variable 3', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ', ' ', ' ', ' ', 'Variable 3', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ'))
View(CombinedData)
or
January.Data | X | X1 | X2 |
---|---|---|---|
Created at 2020-02-01 06:00:12 | |||
For January 2020 | |||
Date | Variable 1 | Variable 2 | Variable 3 |
2020-01-01 | XX | YY | ZZ |
2020-01-02 | XX | YY | ZZ |
2020-01-03 | XX | YY | ZZ |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2020-01-30 | XX | YY | ZZ |
2020-01-31 | XX | YY | ZZ |
February Data | |||
Created at 2020-03-01 06:00:21 | |||
For February 2020 | |||
Date | Variable 1 | Variable 2 | Variable 3 |
2020-02-01 | XX | YY | ZZ |
2020-02-02 | XX | YY | ZZ |
2020-02-03 | XX | YY | ZZ |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2020-02-28 | XX | YY | ZZ |
2020-02-29 | XX | YY | ZZ |
Again, the first three rows of each dataset is causing a lot of trouble, so I wanna manipulate the data to get rid of these "administrative information" in each dataset, and make it look like this when I merge them:
CombinedData2 = data.frame(Date= c( '2020-01-01', '2020-01-02', '2020-01-03', '.', '.', '.', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '.', '.', '.', '2020-02-28', '2020-02-29'),
'Variable 1' = c('XX', 'XX', 'XX','.', '.', '.','XX','XX', 'XX', 'XX', 'XX','.', '.', '.','XX','XX'),
'Variable 2' = c('YY', 'YY', 'YY','.', '.', '.','YY','YY', 'YY', 'YY', 'YY','.', '.', '.','YY','YY'),
'Variable 3' = c('ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ', 'ZZ', 'ZZ', 'ZZ','.', '.', '.','ZZ','ZZ'))
View(CombinedData2)
or
Date | Variable.1 | Variable.2 | Variable.3 |
---|---|---|---|
2020-01-01 | XX | YY | ZZ |
2020-01-02 | XX | YY | ZZ |
2020-01-03 | XX | YY | ZZ |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2020-01-30 | XX | YY | ZZ |
2020-01-31 | XX | YY | ZZ |
2020-02-01 | XX | YY | ZZ |
2020-02-02 | XX | YY | ZZ |
2020-02-03 | XX | YY | ZZ |
. | . | . | . |
. | . | . | . |
. | . | . | . |
2020-02-28 | XX | YY | ZZ |
2020-02-29 | XX | YY | ZZ |
That is, I want to get rid of the first three rows in each monthly dataset and make the 4th row (the row that has "Date", "Variable 1", "Variable 2", and "Variable 3" to represent the the column names. I feel like I'm not doing a great job clearly explaining this, but please let me know if you know how to solve this issue. Please note that the actual data that I'm using has over 20 variables and a bit more complex than the example shown above. I am also quite new to R, so it would be great if you could explain the steps clearly!
Thank you in advance!