Introduction
Apologies if this is already covered ad nauseaum, but I haven't been able to find an example matching my needs. Here's an overview of what I'm trying to accomplish:
I have text files (albeit with non-.txt extensions) generated as exports from proprietary software that I would like to parse into a tibble for management and analysis, then parse back into the native format to upload any changes. These files have a consistent structure, similar to that of JSON/XML/HTML; ideally, they could be harvested/scraped in the same way one would with a website, but I have a feeling that's too ambitious for my current needs.
RegEx has gotten me only so far, and I have a feeling there's a better/efficient way to do this. Can anyone help identify a method or strategy? Examples below:
CSV/JSON-Like Document:
There are two 'components' in the following sample text that exemplify the entire document:
[ProcedureOfOrigin,Export
(CatalogType,
[ComponentReference,Add
(IsActive,TRUE)
(ComponentProperties,
[CatalogDocument,Find
(Name,"Foo")
(ScopeOfFunction,"1")
(DocumentType,"0")
])
(DocumentProperties,
[DocumentSubType,Find
(Description,"Foo Document for Production")
(Name,"Foo Document")
])
])
]
[ProcedureOfOrigin,Export
(CatalogType,
[ComponentReference,Add
(IsActive,TRUE)
(ComponentProperties,
[CatalogDocument,Find
(Name,"Bar")
(ScopeOfFunction,"1")
(DocumentType,"0")
])
(DocumentProperties,
[DocumentSubType,Find
(Description,"Bar Document for Production")
(Name,"Bar Document")
])
])
]
When considered as a Template, I'm looking for values after almost every comma:
[Variable1,Value1
(SubSection1,
[Variable2,Value2
(Variable3,Value3)
(SubSection2,
[Variable4,Value4
(Variable5,"Value5")
(Variable6,"Value6")
(Variable7,"Value7:")
])
(SubSection3,
[Variable8,Value8
(Variable9,"Value9")
(Variable10,"Value10")
])
])
]
Desired Output for the JSON/CSV-Like Document:
ProcedureOfOrigin | ComponentReference | CatalogDocument | Name | ScopeOfFunction | DocumentType |
---|---|---|---|---|---|
Export | Add | Find | Foo | 1 | 0 |
Export | Add | Find | Bar | 1 | 0 |
XML/HTML-Like Document:
The other exported file has a template like the following:
Section1.0:
SubSection1.1: Value1;;
SubSection1.2: Value2;;
SubSection1.3: Value3;;
SubSection1.4: Value4;;
SubSection1.5: Value5;;
SubSection1.6: Value6;;
SubSection1.7: Value7;;
SubSection1.8: YYYY-MM-DD;;
SubSection1.9: Value9;;
Section2.0:
SubSection2.1: This can be a very large block of text with /* Comments in between */
;;
SubSection2.2: Same thing for this and the rest of the following sections.
;;
SubSection2.3: /****** Comments can sometimes take this form *****/
;;
SubSection2.4: And so-on.
;;
Section3.0:
SubSection3.1: Usually a two-word-phrase;;
SubSection3.2:
/* These comments can be OBNOXIOUS
and be multi-
line
With any characters in them
Less important for me to have in general */
;;
Section4.0: Integer
;;
Section5.0: Block of Text
;;
Section6.0: If-Then Statements + Conclusions.
;;
Section7.0:
;;
Section8.0: Integer;;
Section9.0: End-of-Document
Desired Output for XML-Like Document:
Each Section would be its own tibble (think Normalized Relational Database).
SubSection1.1 | SubSection1.2 | SubSection1.3 | SubSection1.4 |
---|---|---|---|
Value1 | Value2 | Value3 | Value4 |