Combining tables from multiple csv files (tables can start in any line of the files)

DanielCoutinho · May 31, 2019, 8:58pm

Hi everyone,

I am looking for a code where I can combine multiple csv files that reside in a folder into a single data frame. It should combine all files within that folder.

The main issue here is that the actual table within these files can start at any line. They can be in line 10 or in line 100 of those files. It is basically a system generated report that creates multiple files, but it always has a summary in the top which is not part of the table. Since the summary can be small or big, the table ends up starting anywhere within those files.

They do have their headers in common though (Total Count, Event Time, Associated Client Count, Authenticated Client Count). And they are always preceded of the words "AP Total Client Count" on the line above before the table starts.

Would really appreciate some help to fix this.

technocrat · May 31, 2019, 9:26pm

If you are on MAC or Linux, it's simple to write a command line script to strip everything down to the header line and concatenate it to a single file. It might even be possible on Window's powershell, but I haven't used it.

The R alternate is to use readLines{base} to bring in the file and then use stringr to filter out all lines that don't begin with a numeric. Then you would write it back as a csv, import it back with readr and use colnames to add back in the headers. It will be more complicated if any of the preheader summaries begin with a numeric.

Once that's working for a single file, iterating over csv files in a single directory is fairly straightforward and we can continue the discussion.

If you get stuck with parsing the summaries, you should post a reproducible example, called a reprex so that more specific help can be offered.

Matthias · June 1, 2019, 11:25am

Actually it's not so complicated.
Let's say this is one of your files:

Bla bla bla

jspjnf
sdnpnm
sdp

More bla bla bla

Table starts here
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Then you read it first, figure out the position, where the table actually starts, and read it again, skipping all the rows that do not contain what you want:

library(readr)
# Read in file
rawfile = read_lines("Table_Text.csv") 

# find the begin of the table
begin_table = match("Table starts here", rawfile)

# Read again, but skip first rows
table = read_delim("Table_Text.csv", skip = begin_table,
                 delim = "\t")

You may need to change the text to look for and maybe the separator in the read_delim, you could also use read_csv. (I just generated a tab-separated file)

Matthias

DanielCoutinho · June 4, 2019, 8:52pm

Hi Matthias, your solution works flawlessly for a part of the problem.

It enables me to capture the table within a single file.

The other part of the issue is not addressed however. There are multiple files in the folder (could be 10, 20, 30, changes every month) and the idea is to look at all the files, extract and union the tables as the information on each one is different (pertains to different days of the week within the a desired timeframe).

How could we make that work?

Matthias · June 5, 2019, 8:33am

You said the main issue is the import part.

Here a potential solution:
1.) Identify the files, and write the list into a vector, check out list.files() for that. files = list.files(path, options)
2.) Create a for-loop, going through the file list
for (file2load in files) {..}
and perform:
2A.) The input of the file. -> see above
imported_file = read_csv(file2load)...
2B.) The combination of the files. This depents on your files, e.g. if all files have the same col.names you can use bind_rows (from dplyr).
merged_file = bind_rows(merged_file, imported_file)

However in this case you should have an empty file first, to actually attach the files to.
This you can do outside(!) of the for-loop with a single file that you imported, just extracting the column names:
merged_file = imported_file1[FALSE,]

Try the combination first with 2 or 3 files you address directly:
file1 =
file2 =
file3 =
files = c(file1, file2, file3)
Then work on the loop and the automated generation of the list.

DanielCoutinho · June 5, 2019, 2:34pm

Hi Matthias,

Thanks once more for the help. It seems I lack the knowledge in R to make this work.

Tried it out on my own, but couldn't get the code to work.

Thanks for the patience and for attempting to help though.

system · June 26, 2019, 2:34pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.