Best practices/packages for processing weirdly structured .txt data file?

Hi folks!

I'm trying to enter a dataset into R that originates from some academic software, and the data output of the software is a pretty idiosyncratic .txt file. It is structured, but not in a structure I've seen before - It's not a csv/tsv, and it's also not JSON.

I'm presuming I'll have to write some custom code to process it in and I'm looking for reccomendations of how to do that ideally using tidyverse/tidyverse adjacent packages.

Here is an example of the structure (lightly anonymised from the real data)


TY  - JOUR
T1  - "Title of first piece of data"
KW  - keyword1
keyword2
keyword3
keyword4
PY  - 2018
DA  - 2018/09//
Y1  - 2018/09//
AB  - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
VL  - 32
IS  - 6
SP  - 638
CY  - United States
SN  - 0893-164X (Linking)
U1  - 43261956
U2  - 30211584
N1  -
ER  -

TY  - JOUR
T1  - "Title of second piece of data"
KW  - keyword1
keyword3
keyword_we_havent_seen_before1
PY  - 2019
DA  - 2019///
Y1  - 2019///
AB  - Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur?
VL  - 66
IS  - 1
SP  - 29
CY  -
UR  - example.com
SN  - 0022-0167
U1  - 43260260
N1  - 2019-02-27
ER  -

Some notes I've made about the data

double-newline indicates a new records

a newline with two capitals and a hyphen indicates a new field within a record

however, newlines also are used WITHIN fields, as in the KW keyword field

Frustratingly, fields don't repeat in a consistent way when there is missing data - the second record has a field UR for urls, but the first record does not have this field at all. However the first record has a field CY for country, and the second record keeps that field but leaves it empty.

Note: I'm a pretty experienced R programmer but this kind of data is new to me - if folks want to give detailed answers they'll certainly be appreciated, but also if you'd prefer to just quickly point me at packages/functions that you'd use on a job like this I'm happy to then dig into reading that documentation on my own.

This is not a complete answer, but it might help get you a little further: I think your data may be in RIS format:

You might also be interested in the bibliometrix package (which I’ve never used, but am aware of — there may be other similar options, too!) which is made for analyzing this type of data, though it seems to need it to be converted to bibtex format first. I don’t specifically, know of an R-based RIS to bibtex converter, but there might be one out there, and anyway a web search will turn up several other ways to accomplish that step.

Hey! That's true, thanks!

It still seems to be a confusing implementation of RIS, for example it uses two newlines to end a record, breaking this convention:

"Multiple citation records can be present in a single RIS file. A record ends with an "end record" tag ER - with no additional blank lines between records." (from the wikipedia)

Weird but I feel on a track now! Thanks for your advice :slight_smile:

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.