Converting undelimited text file into data frame / Converting character vector with missing data (different lengths) into data frame

Zampano · January 28, 2023, 7:18pm

I am trying to read in a .txt file and parse part of it, which is a table, into a data frame. So far I've read the .txt file into r using read_lines and selecting the rows of the file I want to work with. This leaves me with a character vector with a length of over 500.

I can convert this into a table using either:
data.frame(str_split_fixed(myStringName, " +", 21)) which separates by white space into 21 columns

k <- read.table(text=myStringName, header = TRUE, fill = TRUE)

The issue is that there are blank spaces in some "cells" of the table where data is not reported. So in instances where not every "cell" has data in a row that data gets shifted to the left and the output looks something like this (I've hidden the first 11 columns which all have data). Here, only rows 4 and 5 are correct, the others get shifted to the left depending on how much is missing.

       1  2      3     4        5        6 7        8        9 10
1: -0.71  4 0.5158 0.000    1.000        2 1 CLIM1_sc        |   
2:  0.18  2 0.8757 1.000        1 CLIM1_sc |                     
3: -0.33  0 0.0000 1.000 CLIM1_sc        |                       
4: -0.14 45 0.8881 0.157    .6919      .54 9        2 CLIM2_sc  |
5: -0.07  3 0.9469 0.500    .4795      .00 2        2 CLIM2_sc  |
6: -0.26  2 0.8220 1.000        2 CLIM2_sc |

How can I read this data into a table so that it outputs correctly? I would like the output to look something like this:

       1  2      3     4        5        6 7        8        9 10
1: -0.71  4 0.5158 0.000    1.000        2 1          CLIM1_sc  |   
2:  0.18  2 0.8757 1.000                 1            CLIM1_sc  |                     
3: -0.33  0 0.0000 1.000                              CLIM1_sc  |                       
4: -0.14 45 0.8881 0.157    .6919      .54 9        2 CLIM2_sc  |
5: -0.07  3 0.9469 0.500    .4795      .00 2        2 CLIM2_sc  |
6: -0.26  2 0.8220 1.000                   2          CLIM2_sc  |

I could use a fixed width for this one file, but I would like to do this for multiple files and some of the lengths of the first few columns are different.

TalhaAsif · January 28, 2023, 9:02pm

Utilizing the read.table() function with the sep argument set to a regular expression that matches multiple whitespaces is one approach to this issue. The function will not shift the data to the left because it will treat multiple whitespaces as a single delimiter. The na.strings argument can then be used to specify which strings, like "|," should be considered missing values. Here is an illustration of how you can incorporate these arguments into your data:

k <- read.table(text=myStringName, header = TRUE, fill = TRUE, sep = "\\s+", na.strings = "|")

Zampano · January 28, 2023, 10:14pm

Thanks for the reply. When I try that with data.table I get the following output:

Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE,  :   invalid 'sep' value: must be one byte

I've been able to use regex in str_split and using "\\s+" works, but it results in the same output with data after blank spaces being pushed to the left:

splitString <- str_split(myStringName, "\\s+")

df <- do.call(rbind.data.frame, splitString)

system · March 11, 2023, 10:15pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.