Efficient way to read text file with repeated blocks/chunks?

RuReady · February 13, 2019, 8:17pm

Hi everyone,

I have this sample text file with the following format:

1st line: the number of PAR blocks
2nd-3rd line: name and some corresponding parameters
Next 5 lines: array of 5 x 10 elements

I was wondering what would be the best way to read this kind of text file? My actual file has about 100 million lines. My current approach is to read the whole file in with scan() then step by step cycle through each block but it's slow

Any suggestion appreciated!

3      
PAR01
CONST 3 F4.0 6
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0
PAR02
CONST 3 F4.0 6
000.0 000.0 000.0 000.0 000.0 000.0 000.0 000.0 000.0 000.0 
000.0 000.0 000.0 000.0 000.0 000.0 000.0 000.0 000.0 000.0 
123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 
123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 
123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0
PAR03
CONST 3 F4.0 6
123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 123.0 
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 
111.0 111.0 111.0 111.0 111.0 111.0 111.0 111.0 111.0 111.0 
111.0 111.0 111.0 111.0 111.0 111.0 111.0 111.0 111.0 111.0 
999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0 999.0

Leon · February 13, 2019, 8:57pm

This is a so-called stateful parsing problem and I have yet to find a good way of doing that in R, so I default to habits-of-old and use a small perl script with line-by-line regex (but don't tell anyone )

technocrat · February 13, 2019, 9:02pm

There are four important threshold questions.

What format is your source file?
What is your target data structure?
Does your platform have sufficient RAM for both the data and libraries you'll be using?
What's the unit of analysis? Can you process each PAR block independently to boil it down to a smaller object? Promotes lazy evaluation: fetch one, read one, process one, store one, repeat. Alternatively are their inter-par element interactions that require everything to be in memory all at once?

I agree with @Leon that parsing beyond flat files is awkward in R, but I disagree that pre-processing is shameful. Generically, this feels like a problem that requires lex/yaac (sorry, flex/bison) compiled to C++ for use with Rcpp

nwerth · February 13, 2019, 9:39pm

R can process input as a stream, it's just not the default. If readLines() or read.table() or similar are given a file path as a character value, they'll create the connection, read from it, and close it when done. To keep it open, you have to create the connection object yourself.

For your example, let's say the sample text is stored in a file named sample.txt. First, define a function that takes the 7 lines for each chunk and returns a nice object. I'm not sure what you want, so I'm thinking a list with 3 items: name, parameters as a named numeric, and a matrix.

parse_block <- function(block_lines) {
  # Block name is just first line
  name <- block_lines[1]

  # Create the named parameters vector
  # Includes defensive programming in case the line is blank
  param_parts <- strsplit(block_lines[2], "\\s+")[[1]]
  if (length(param_parts)) {
    name_indices  <- seq(1, length(param_parts), by = 2)
    value_indices <- seq(2, length(param_parts), by = 2)
    parameters <- as.numeric(param_parts[value_indices])
    names(parameters) <- param_parts[name_indices]
  } else {
    parameters <- numeric(0)
  }

  # Split the array lines by spacing, then use them for a matrix
  mat_strings <- unlist(strsplit(block_lines[3:7], "\\s+"))
  mat <- matrix(
    as.numeric(mat_strings),
    nrow  = 5,
    byrow = TRUE
  )

  # Compose the output
  list(
    name       = name,
    parameters = parameters,
    values     = mat
  )
}

Next, connect to the file. Read the top line, which helpfully gives the length of the result. We'll store the outputs in a list.

conn <- file("sample.txt", open = "r")
block_count <- scan(conn, nlines = 1)
block_count
# [1] 3

output <- vector("list", block_count)

Now we can read the connection in a loop, each time reading 7 lines from where the previous loop left off. This is because conn is not being closed.

for (ii in seq_len(block_count)) {
  next_chunk <- readLines(conn, n = 7)
  output[[ii]] <- parse_block(next_chunk)
}

close(conn)

output[[3]]
# $name
# [1] "PAR03"
# 
# $parameters
# CONST  F4.0 
#     3     6 
# 
# $values
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]  123  123  123  123  123  123  123  123  123   123
# [2,]  999  999  999  999  999  999  999  999  999   999
# [3,]  111  111  111  111  111  111  111  111  111   111
# [4,]  111  111  111  111  111  111  111  111  111   111
# [5,]  999  999  999  999  999  999  999  999  999   999

RuReady · February 13, 2019, 10:36pm

Thank you @technocrat! My answers to your questions are below

What format is your source file? It's a text file
What is your target data structure? list or 3D array is fine. I only need to make changes to those blocks and write them to a new file with the exact structure
Does your platform have sufficient RAM for both the data and libraries you'll be using? My PC has 16GB RAM but I can use our cluster which has 256GB RAM too
What's the unit of analysis? The whole file is an input file for a program. All blocks need to be modified before the other program can start running. That being said I can read each block then modify it, append to a new file

RuReady · February 13, 2019, 10:41pm

Thanks @nwerth! This is exactly what I am doing at the moment. The only difference is that I store the result in a 3D array. I guess this is the only way to go for this type of data

technocrat · February 13, 2019, 10:54pm

Then I think @nwerth has the right path: read-a-chunk, process-a-chunk, write/append-a chunk.

nwerth · February 14, 2019, 2:06pm

That's much better than my crudely nested list. Just make sure to create the array first and fill it in with the loop. That avoids the painfully slow peril of "growing" objects.

output <- array(NA_real, dim = c(block_count, 5, 10))
for (ii in seq_len(block_count)) {
  next_chunk <- readLines(conn, n = 7)
  output[ii, , ] <- parse_block_to_array(next_chunk)
}

RuReady · February 15, 2019, 4:05pm

TYSM @nwerth !!!!!!!!

cderv · February 15, 2019, 8:05pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

system · February 22, 2019, 8:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.