I am looking for a code where I can combine multiple csv files that reside in a folder into a single data frame. It should combine all files within that folder.
The main issue here is that the actual table within these files can start at any line. They can be in line 10 or in line 100 of those files. It is basically a system generated report that creates multiple files, but it always has a summary in the top which is not part of the table. Since the summary can be small or big, the table ends up starting anywhere within those files.
They do have their headers in common though (Total Count, Event Time, Associated Client Count, Authenticated Client Count). And they are always preceded of the words "AP Total Client Count" on the line above before the table starts.
If you are on MAC or Linux, it's simple to write a command line script to strip everything down to the header line and concatenate it to a single file. It might even be possible on Window's powershell, but I haven't used it.
The R alternate is to use readLines{base} to bring in the file and then use stringr to filter out all lines that don't begin with a numeric. Then you would write it back as a csv, import it back with readr and use colnames to add back in the headers. It will be more complicated if any of the preheader summaries begin with a numeric.
Once that's working for a single file, iterating over csv files in a single directory is fairly straightforward and we can continue the discussion.
Then you read it first, figure out the position, where the table actually starts, and read it again, skipping all the rows that do not contain what you want:
library(readr)
# Read in file
rawfile = read_lines("Table_Text.csv")
# find the begin of the table
begin_table = match("Table starts here", rawfile)
# Read again, but skip first rows
table = read_delim("Table_Text.csv", skip = begin_table,
delim = "\t")
You may need to change the text to look for and maybe the separator in the read_delim, you could also use read_csv. (I just generated a tab-separated file)
Hi Matthias, your solution works flawlessly for a part of the problem.
It enables me to capture the table within a single file.
The other part of the issue is not addressed however. There are multiple files in the folder (could be 10, 20, 30, changes every month) and the idea is to look at all the files, extract and union the tables as the information on each one is different (pertains to different days of the week within the a desired timeframe).
Here a potential solution:
1.) Identify the files, and write the list into a vector, check out list.files() for that. files = list.files(path, options)
2.) Create a for-loop, going through the file list
for (file2load in files) {..}
and perform:
2A.) The input of the file. -> see above
imported_file = read_csv(file2load)...
2B.) The combination of the files. This depents on your files, e.g. if all files have the same col.names you can use bind_rows (from dplyr).
merged_file = bind_rows(merged_file, imported_file)
However in this case you should have an empty file first, to actually attach the files to.
This you can do outside(!) of the for-loop with a single file that you imported, just extracting the column names:
merged_file = imported_file1[FALSE,]
Try the combination first with 2 or 3 files you address directly:
file1 =
file2 =
file3 =
files = c(file1, file2, file3)
Then work on the loop and the automated generation of the list.