How to read text file of approximately 40000 lines text file in R ?

Shri1506 · March 5, 2020, 12:39pm

This question might look really simple, but I have some doubts even after following many ways to read this data.

First - The file is in text format and has approx- 40000 lines. and I want to read all data initially.

Second - And I need value of dF , fRPMmean , szSystemID from read data for some calculations later.

Third - I want to use data which is between line "[specdata0]" and "#--finish--" for my analysis.

My data from text file looks in following way -

Continued version of my data (i copied sample data)
6.0243638E-5
1.19034885E-4
1.3678148E-4
1.09321154E-4
2.7332282E-5
2.2741018E-5
3.159504E-5
8.073375E-5
3.746524E-5
5.031867E-5
2.6451544E-5
4.029416E-5
4.7287827E-5
3.806267E-5
2.3926008E-5
2.3086282E-5
3.781592E-5
5.438561E-5
5.82364E-5
1.0197797E-4
1.5383417E-5
2.9165532E-5
4.7294132E-5
6.0461047E-5
3.9730767E-5
#--finish--

I need data between [specdata0] and #--finish-- for my analysis.

Thank You

pieterjanvc · March 5, 2020, 1:05pm

Hi,

Here is one way of doing this:

I created a dummy file called test.txt:

#ignore
#ignore
#ignore

[ignore]
ignore
ignore
ignore
ignore

[specdata0]
4.7287827E-5
3.806267E-5
2.3926008E-5
2.3086282E-5
3.781592E-5
5.438561E-5
5.82364E-5
1.0197797E-4
1.5383417E-5
2.9165532E-5
4.7294132E-5
6.0461047E-5
3.9730767E-5
#--finish--

ignore
ignore

Now let's extract the data:

myData = readLines("test.txt")
myData = myData[(which(myData == "[specdata0]")+1):
                  (which(myData == "#--finish--")-1)]
myData = as.numeric(myData)

myData
 [1] 4.728783e-05 3.806267e-05 2.392601e-05 2.308628e-05
 [5] 3.781592e-05 5.438561e-05 5.823640e-05 1.019780e-04
 [9] 1.538342e-05 2.916553e-05 4.729413e-05 6.046105e-05
[13] 3.973077e-05

This is the simplest implementation, and will only work if there is one [specdata0] and one #--finish-- but can be changed if there are more in the file and one or all are needed

Hope this helps,
PJ

Shri1506 · March 5, 2020, 1:20pm

Hi Pj,

Your code definitely works,
but I do need data lines containing value of dF, fRPMmean and also szSystemID.
As I have formula to detrmine number of data to be used for analysis,

formula = (fRPMmean * 4) / dF ............................ this gives me number of data elements I need to consider.

Thank You,

pieterjanvc · March 6, 2020, 2:03am

Hi,

I extended the code to generalize and create a list which contains all data in the format as you present it. It should work if the structure of your file is not deviating too much from your example

The file

#Comment
#Comment
#Comment

[specchannel10]
iVersion=TRUE
fFmax=test
fRPMean=18.46

[specdata0]
4.7287827E-5
3.806267E-5
2.3926008E-5
2.3086282E-5
3.781592E-5
5.438561E-5
5.82364E-5
1.0197797E-4
1.5383417E-5
2.9165532E-5
4.7294132E-5
6.0461047E-5
3.9730767E-5
#--finish--

The processing

library(stringr)
library(readr)

myFile = readLines("test.txt")

myResult = list()
#Find the position of the variables that are between []
 #We add the last line number as a position as well
vars = c(which(str_detect(myFile, "^\\[.*\\]\\s*$") == T), length(myFile))

#Get the content for each variable
for(i in 1:(length(vars)-1)){
  myData = myFile[vars[i]:(vars[i +1] - 1)]
  #Remove lines that are comments or blank
  myData = myData[!str_detect(myData, "^\\s*#|^\\s*$")]
  
  #If the content is a list of variables, create them as a list
  if(str_detect(myData[2], "=")){
    content = str_split(myData[-1], "=")
    result = lapply(lapply(content, "[", 2), parse_guess)
    names(result) = sapply(content, "[", 1)
  } else {
 #If the content just a vector of data, extract it
    result = parse_guess(myData[-1])
  }
  
 #Create the variable as a list item and assign the content
  myResult[[str_remove_all(myData[1], "\\[|\\]")]] = result
}

The result

> myResult
$specchannel10
$specchannel10$iVersion
[1] TRUE

$specchannel10$fFmax
[1] "test"

$specchannel10$fRPMean
[1] 18.46


$specdata0
 [1] 4.728783e-05 3.806267e-05 2.392601e-05 2.308628e-05 3.781592e-05 5.438561e-05 5.823640e-05
 [8] 1.019780e-04 1.538342e-05 2.916553e-05 4.729413e-05 6.046105e-05 3.973077e-05

This code generates a list in which every variable between square brackets is a sublist and the content of each sublist is either a list of variables or a vector of data.

Hope this helps,
PJ

Shri1506 · March 12, 2020, 10:46am

Hey Pj,

Thanks for your code, it creates the sub-lists. But I need to create some kind of loop which extracts the value of dF, fRPMmean for calculating number of data values (which is mainly data values between [specdata0] and ##--finish--) to be considered for analysis. The value of dF and fRPMmean occur just once in this file.

Formula to calulate number of data va;ues to be considered for analysis = (fRPMmean *4)/dF.

Thank You

pieterjanvc · March 12, 2020, 11:02am

Hi,

I don't know what you need extra here, because my result exactly gives you that...
If you need the value for dF, you access it by result$specchannel10$dF same for fRPMean which is result$specchannel10$fRPMean so in your formula that would be:

result$specchannel10$fRPMean * 4 / result$specchannel10$dF

#OR

fRPMean = result$specchannel10$fRPMean
dF = result$specchannel10$dF

fRPMean  * 4 / dF

Does this help?
PJ

Shri1506 · March 12, 2020, 12:08pm

Thank You very much Pj.

Shri1506 · March 13, 2020, 8:58am

Hey Pj,

After this step I grouped data

myFile = myFile [1: ( result$specchannel10$fRPMean * 4 / result$specchannel10$dF) ].

Now i wanted to use split function to split the data into 3 parts of 1000 each and remaining data outside 3 group will be in 4th group. This grouping should take place without sorting of data.

But how can I use split function ? or any other function can be used.

pieterjanvc · March 13, 2020, 11:00am

Hello,

We suggest you open a new topic if your initial question has been answered and you have a new one given the title only refers to the first question and things can become messy afterwards.

Since this question is relatively easy and was recently posted by another member on this forum, I'll just forward the topic link:

Hope this helps,
PJ

system · March 20, 2020, 11:00am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.