Hi folks!
I'm trying to enter a dataset into R that originates from some academic software, and the data output of the software is a pretty idiosyncratic .txt file. It is structured, but not in a structure I've seen before - It's not a csv/tsv, and it's also not JSON.
I'm presuming I'll have to write some custom code to process it in and I'm looking for reccomendations of how to do that ideally using tidyverse/tidyverse adjacent packages.
Here is an example of the structure (lightly anonymised from the real data)
TY - JOUR
T1 - "Title of first piece of data"
KW - keyword1
keyword2
keyword3
keyword4
PY - 2018
DA - 2018/09//
Y1 - 2018/09//
AB - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
VL - 32
IS - 6
SP - 638
CY - United States
SN - 0893-164X (Linking)
U1 - 43261956
U2 - 30211584
N1 -
ER -
TY - JOUR
T1 - "Title of second piece of data"
KW - keyword1
keyword3
keyword_we_havent_seen_before1
PY - 2019
DA - 2019///
Y1 - 2019///
AB - Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur?
VL - 66
IS - 1
SP - 29
CY -
UR - example.com
SN - 0022-0167
U1 - 43260260
N1 - 2019-02-27
ER -
Some notes I've made about the data
double-newline indicates a new records
a newline with two capitals and a hyphen indicates a new field within a record
however, newlines also are used WITHIN fields, as in the KW keyword field
Frustratingly, fields don't repeat in a consistent way when there is missing data - the second record has a field UR for urls, but the first record does not have this field at all. However the first record has a field CY for country, and the second record keeps that field but leaves it empty.
Note: I'm a pretty experienced R programmer but this kind of data is new to me - if folks want to give detailed answers they'll certainly be appreciated, but also if you'd prefer to just quickly point me at packages/functions that you'd use on a job like this I'm happy to then dig into reading that documentation on my own.