General Help with Text Extraction

Dr_Dick_MD · December 4, 2017, 11:10pm

Introduction

Apologies if this is already covered ad nauseaum, but I haven't been able to find an example matching my needs. Here's an overview of what I'm trying to accomplish:

I have text files (albeit with non-.txt extensions) generated as exports from proprietary software that I would like to parse into a tibble for management and analysis, then parse back into the native format to upload any changes. These files have a consistent structure, similar to that of JSON/XML/HTML; ideally, they could be harvested/scraped in the same way one would with a website, but I have a feeling that's too ambitious for my current needs.

RegEx has gotten me only so far, and I have a feeling there's a better/efficient way to do this. Can anyone help identify a method or strategy? Examples below:

CSV/JSON-Like Document:

There are two 'components' in the following sample text that exemplify the entire document:

[ProcedureOfOrigin,Export
(CatalogType,
      [ComponentReference,Add
      (IsActive,TRUE)
      (ComponentProperties,
            [CatalogDocument,Find
            (Name,"Foo")
            (ScopeOfFunction,"1")
            (DocumentType,"0")
            ])
      (DocumentProperties,
            [DocumentSubType,Find
            (Description,"Foo Document for Production")
            (Name,"Foo Document")
            ])
      ])
]
[ProcedureOfOrigin,Export
(CatalogType,
      [ComponentReference,Add
      (IsActive,TRUE)
      (ComponentProperties,
            [CatalogDocument,Find
            (Name,"Bar")
            (ScopeOfFunction,"1")
            (DocumentType,"0")
            ])
      (DocumentProperties,
            [DocumentSubType,Find
            (Description,"Bar Document for Production")
            (Name,"Bar Document")
            ])
      ])
]

When considered as a Template, I'm looking for values after almost every comma:

[Variable1,Value1
(SubSection1,
      [Variable2,Value2
      (Variable3,Value3)
      (SubSection2,
            [Variable4,Value4
            (Variable5,"Value5")
            (Variable6,"Value6")
            (Variable7,"Value7:")
            ])
      (SubSection3,
            [Variable8,Value8
            (Variable9,"Value9")
            (Variable10,"Value10")
            ])
      ])
]

Desired Output for the JSON/CSV-Like Document:

ProcedureOfOrigin	ComponentReference	CatalogDocument	Name	ScopeOfFunction	DocumentType
Export	Add	Find	Foo	1	0
Export	Add	Find	Bar	1	0

XML/HTML-Like Document:

The other exported file has a template like the following:

Section1.0:
	SubSection1.1:  Value1;;
	SubSection1.2:  Value2;;
	SubSection1.3:  Value3;;
	SubSection1.4:  Value4;;
	SubSection1.5:  Value5;;
	SubSection1.6:  Value6;;
	SubSection1.7:  Value7;;
	SubSection1.8:  YYYY-MM-DD;;
	SubSection1.9:  Value9;;

Section2.0:
	SubSection2.1: This can be a very large block of text with /* Comments in between */ 
	;;
	SubSection2.2: Same thing for this and the rest of the following sections. 
	;;
	SubSection2.3: /****** Comments can sometimes take this form *****/
	;;
	SubSection2.4: And so-on.
	;;
Section3.0:
	SubSection3.1: Usually a two-word-phrase;;
	SubSection3.2:
	/* These comments can be OBNOXIOUS
	and be multi-
	line
	With any characters in them
	Less important for me to have in general */
	;;
	Section4.0:  Integer
	;;
	Section5.0:  Block of Text
	;;
	Section6.0: If-Then Statements + Conclusions.
	;;
	Section7.0:
	;;
Section8.0: Integer;;
Section9.0: End-of-Document

Desired Output for XML-Like Document:

Each Section would be its own tibble (think Normalized Relational Database).

SubSection1.1	SubSection1.2	SubSection1.3	SubSection1.4
Value1	Value2	Value3	Value4

technocrat · November 29, 2018, 5:54pm

I hope by now you've found a way forward. If so, I'm guessing you probably discovered the need for pre-processing.

> obj1 <- "[ProcedureOfOrigin,Export
+ (CatalogType,
+       [ComponentReference,Add
+       (IsActive,TRUE)
+       (ComponentProperties,
+             [CatalogDocument,Find
+             (Name,"Foo")
Error: unexpected symbol in:
"            [CatalogDocument,Find
            (Name,"Foo"
>             (ScopeOfFunction,"1")
Error: unexpected ',' in "            (ScopeOfFunction,"
>             (DocumentType,"0")
Error: unexpected ',' in "            (DocumentType,"
>             ])
Error: unexpected ']' in "            ]"
>       (DocumentProperties,
Error: unexpected ',' in "      (DocumentProperties,"
>             [DocumentSubType,Find
Error: unexpected '[' in "            ["
>             (Description,"Foo Document for Production")
Error: unexpected ',' in "            (Description,"
>             (Name,"Foo Document")
Error: unexpected ',' in "            (Name,"
>             ])
Error: unexpected ']' in "            ]"
>       ])
Error: unexpected ']' in "      ]"
> ]"
Error: unexpected ']' in "]"

i.e., there are many special characters that need to be escaped.

The other document type is simpler, but probably could also benefit from pre-processing. Python would be adequate for this, bison/flex probably better, especially if you have a large volume.

When you have it beaten into shape the parser package (https://goo.gl/JCvYeh) can beat the data into a data frame, just a step away from a tibble, isolate the comments, perhaps with help from tidytext and %>% select(-unwanted) will get rid of the unneeded fields. You'll also want tidytext for plain text.