Convert character string into table

jonspring · May 30, 2018, 10:42pm

I used tesseract::ocr to extract a character string from a vector of png files. This created an object like this, with one long concatenated value for each png file.

start <- 
c("PreVIous Day CompOSIte Report\nStandard Previous Day Composite Report\nAs of 04l16l2018",
 "PreVIous Day CompOSIte Report\nStandard Previous Day Composite Report\nAs of 04l17l2018")

I want to convert this into a table with a row for each line as denoted by "\n", something like this:

target <-
  tibble::tribble(
  ~page, ~line, ~text,
      1,     1, "PreVIous Day CompOSIte Report",
      1,     2, "Standard Previous Day Composite Report",
      1,     3, "As of 04l16l2018",
      2,     1, "PreVIous Day CompOSIte Report",
      2,     2, "Standard Previous Day Composite Report",
      2,     3, "As of 04l17l2018")

I bet this is something simple with readr or tidytext, but it's eluding me!

karawoo · May 31, 2018, 4:19am

Here's one approach -- we map() over the vector of strings, splitting each one on the newline and creating a data frame for each set, then join the list of data frames together with bind_rows().

library("tidyverse")
library("stringr")

start <-
  c("PreVIous Day CompOSIte Report\nStandard Previous Day Composite Report\nAs of 04l16l2018",
    "PreVIous Day CompOSIte Report\nStandard Previous Day Composite Report\nAs of 04l17l2018")

dat <- map(start, function(x) {
  tibble(text = unlist(str_split(x, pattern = "\\n"))) %>%
    rowid_to_column(var = "line")
})

bind_rows(dat, .id = "page") %>%
  select(page, line, text) # Reorder the columns
#> # A tibble: 6 x 3
#>   page   line text                                  
#>   <chr> <int> <chr>                                 
#> 1 1         1 PreVIous Day CompOSIte Report         
#> 2 1         2 Standard Previous Day Composite Report
#> 3 1         3 As of 04l16l2018                      
#> 4 2         1 PreVIous Day CompOSIte Report         
#> 5 2         2 Standard Previous Day Composite Report
#> 6 2         3 As of 04l17l2018

Created on 2018-05-30 by the reprex package (v0.2.0).

cderv · May 31, 2018, 6:12am

Here is another approach with stringr::str_split. As str_split is vectorised over string, you could use directly on your start vector. With tidyverse, it could look that way :

start <- 
  c("PreVIous Day CompOSIte Report\nStandard Previous Day Composite Report\nAs of 04l16l2018",
    "PreVIous Day CompOSIte Report\nStandard Previous Day Composite Report\nAs of 04l17l2018")
library(tidyverse)
#> Warning: le package 'stringr' a été compilé avec la version R 3.4.4
tibble(text = start) %>%
  mutate(page = 1:length(text),
         text = str_split(text, pattern = "\\n")) %>%
  unnest() %>%
  group_by(page) %>%
  mutate(line = 1:length(text)) %>%
  ungroup(page) %>%
  select(page, line, text)
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> # A tibble: 6 x 3
#>    page  line text                                  
#>   <int> <int> <chr>                                 
#> 1     1     1 PreVIous Day CompOSIte Report         
#> 2     1     2 Standard Previous Day Composite Report
#> 3     1     3 As of 04l16l2018                      
#> 4     2     1 PreVIous Day CompOSIte Report         
#> 5     2     2 Standard Previous Day Composite Report
#> 6     2     3 As of 04l17l2018

Created on 2018-05-31 by the reprex package (v0.2.0).