String Splitting for file path (for entire data set) with one column not a single path given by the user

Nidha · April 9, 2019, 11:32am

Incoming/Amesh_CV_v1.docx
Incoming/Amesh_CV_v2.docx
Incoming/Amesh_CV_v3.docx
Incoming/Amesh_CV_v4.docx
Incoming/Amesh_CV_v6.docx
Incoming/Amesh_Q_v1.docx
Incoming/MIT/Akash_MIT_SoP_v1.docx
Incoming/MIT/Akash_MIT_SoP_v2.docx

above data is in data.frame which is in the folder E drive inside "Amesh's" folder

so by giving data.frame

strsplit('Incoming/Amesh_CV_v1.docx', '')
I want the output to be like
Path and Version in separate column

It should based on the Version i.e (v1, _v1, V1 and so on)

I tried text processing using tidytext
stringr
regmatches with gregexpr
string_extract using regex (regular expression)
I am not able to get any output

Kindly let me know I am trying this from two days.

mara · April 9, 2019, 12:02pm

An imperfect, but somewhat workable solution…

I'm using the fill argument of separate() to get all of the files into the same column, despite the fact that the files are at different depths (you could probably extract file names using a regular expression, too, but I'm not that good at them).

Then, for the versions, I'm using a regular expression with str_extract() saying "give me lower-case v followed by numbers" (see more on stringr and regular expressions here).

suppressPackageStartupMessages(library(tidyverse))
library(stringr)

files <- c("Incoming/Amesh_CV_v1.docx",
  "Incoming/Amesh_CV_v2.docx",
  "Incoming/Amesh_CV_v3.docx",
  "Incoming/Amesh_CV_v4.docx",
  "Incoming/Amesh_CV_v6.docx",
  "Incoming/Amesh_Q_v1.docx",
  "Incoming/MIT/Akash_MIT_SoP_v1.docx",
  "Incoming/MIT/Akash_MIT_SoP_v2.docx")

tibble(files) %>%
  separate(files, into = c("lv1", "lv2", "lv3"), sep = "/", fill = "left") %>%
  mutate("version" = str_extract(lv3, regex("v\\d+")))
#> # A tibble: 8 x 4
#>   lv1      lv2      lv3                   version
#>   <chr>    <chr>    <chr>                 <chr>  
#> 1 <NA>     Incoming Amesh_CV_v1.docx      v1     
#> 2 <NA>     Incoming Amesh_CV_v2.docx      v2     
#> 3 <NA>     Incoming Amesh_CV_v3.docx      v3     
#> 4 <NA>     Incoming Amesh_CV_v4.docx      v4     
#> 5 <NA>     Incoming Amesh_CV_v6.docx      v6     
#> 6 <NA>     Incoming Amesh_Q_v1.docx       v1     
#> 7 Incoming MIT      Akash_MIT_SoP_v1.docx v1     
#> 8 Incoming MIT      Akash_MIT_SoP_v2.docx v2

^{Created on 2019-04-09 by the reprex package (v0.2.1)}

Aside: If you do have access to the actual file paths, the fs package has a lot of nice helpers (e.g. is_dir(), is_file()) that could be handy.

Nidha · April 9, 2019, 12:11pm

thank you ma'am but if i have to take this from a dataset or a data frame then i am getting a error as non character argument in r

what to do for this?
Kindly reply.

Nidha · April 9, 2019, 12:11pm

there are 3000 file paths so i cant enter all into R right so how to put entire data in it.\

mara · April 9, 2019, 12:16pm

I can't tell without the exact error message and a sample of the data, but I'm guessing that you don't have stringsAsFactors=FALSE set, and, thus, are trying to do string operations on factors. You can convert this by changing the column in the data frame character with as.character().

https://stat.ethz.ch/R-manual/R-devel/library/base/html/character.html

mara · April 9, 2019, 12:18pm

This is a separate question, and the answer depends on the file format. There's readLines() in base R, as well as read.csv(), etc. — both of these things can also be done with other packages, such as readr and data.table.

Nidha · April 9, 2019, 12:39pm

I am getting a error: non character argument in r
there are versions here right:
#> # A tibble: 8 x 4
#> lv1 lv2 lv3 version
#>
#> 1 Incoming Amesh_CV_v1.docx v1
#> 2 Incoming Amesh_CV_v2.docx v2
#> 3 Incoming Amesh_CV_v3.docx v3
#> 4 Incoming Amesh_CV_v4.docx v4
#> 5 Incoming Amesh_CV_v6.docx v6
#> 6 Incoming Amesh_Q_v1.docx v1
#> 7 Incoming MIT Akash_MIT_SoP_v1.docx v1
#> 8 Incoming MIT Akash_MIT_SoP_v2.docx v2

i.e : v1, v2, v3, v4, v6
v5 is missing there so how to extract the missing version
like: Version V5 is missing (i want it to be like this)

I will try what you said
Thank you ma'am

Nidha · April 9, 2019, 1:15pm

files <- data.frame("E:/Review/Angshuman_Baruah", stringsAsFactors = FALSE)
tibble(files) %>%
separate(files, into = c("lv1", "lv2", "lv3"), sep = "/", fill = "left") %>%
mutate("version" = str_extract(lv3, regex("v\d+")))

I am getting this

A tibble: 1 x 4

lv1 lv2 lv3 version

1 E: Review Angshuman_Baruah NA

if i am giving with the dir command then not working

files <- data.frame(dir("E:/Review/Angshuman_Baruah", stringsAsFactors = FALSE))

Error in dir("E:/Review/Angshuman_Baruah", stringsAsFactors = FALSE) :
unused argument (stringsAsFactors = FALSE)

Nidha · April 9, 2019, 1:20pm

Kindly reply Ma'am. For the files which are there in the directory

files <- data.frame(dir("E:/Review/Angshuman_Baruah", stringsAsFactors = FALSE))
Error in dir("E:/Review/Angshuman_Baruah", stringsAsFactors = FALSE) :
unused argument (stringsAsFactors = FALSE)

files <- data.frame("E:/Review/Angshuman_Baruah", stringsAsFactors = FALSE)

A tibble: 1 x 4

lv1 lv2 lv3 version

1 E: Review Angshuman_Baruah NA

I want for this
dir("E:/Review/Angshuman_Baruah", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE)

it has 49 entries
how to give out like this for the above file Ma'am

A tibble: 8 x 4

lv1 lv2 lv3 version

1 NA Incoming Amesh_CV_v1.docx 1
2 NA Incoming Amesh_CV_v2.docx 2
3 NA Incoming Amesh_CV_v3.docx 3
4 NA Incoming Amesh_CV_v4.docx 4
5 NA Incoming Amesh_CV_v6.docx 6
6 NA Incoming Amesh_Q_v1.docx 1
7 Incoming MIT Akash_MIT_SoP_v1.docx 1
8 Incoming MIT Akash_MIT_SoP_v2.docx 2

mara · April 9, 2019, 1:30pm

Do you have a list of files as a dataset, or are you trying to get a list of the files that are in a directory?

If the latter, you can use fs::dir_ls() (after installing fs, of course).

If you're still having trouble, a self-contained reprex (short for reproducible example) will help us help you.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

Nidha · April 9, 2019, 1:43pm

dir("E:/Review/Nidha Khan", pattern=NULL, all.files=FALSE,
full.names=TRUE, recursive = TRUE)

details<-data.frame(dir("E:/Review/Nidha Khan", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE))

Directory

By using above code I got files which are there inside the folder by giving recursive = TRUE
folder inside folder then files
so if i run the above code (first code) then it display all the file
if i run second code then it displays all the file in a data frame but
when i am trying to use that data frame as stringasfactor doesnot work because of the dir command

i want the output like

From the folder inside folder - files
i want to access all the file and display the versions for the filename
i.e: v1 ,v2 ,v4 ,v5, v6...............
if v 3 is missing i want output as
version v3 missing

this is the goal

Ma'am could you please help.

Nidha · April 9, 2019, 1:47pm

the command which u sent earlier work for the input data, it doesnot work for the data.frame or files which are there inside the directory
i.e: E:/Review/Nidha Khan inside this there are folders like "Nidha Khan/MSc" inside this there are docx, xlsx, png and some folders (file inside this)

Nidha · April 9, 2019, 1:48pm

trying to get list of file in the directory by using

dir("E:/Review/Nidha_Khan", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE)

details<-data.frame(dir("E:/Review/Nidha_Khan", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE))

mara · April 9, 2019, 2:02pm

It's not clear to me what you're working with yet. I can see what you're trying to do, but not what you're getting back.

For splitting a character variable into an unknown number of columns, see the approaches mentioned in the thread below:

Nidha · April 9, 2019, 2:06pm

No its just folder inside directories which i have to extract
dir("E:/Review/Nidha_Khan", pattern=NULL, all.files=FALSE,
full.names=TRUE, recursive = TRUE)

this is the command

above command in data.frame

details<-data.frame(dir("E:/Review/Angshuman_Baruah", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE))

files <- c("Incoming/Amesh_CV_v1.docx",
"Incoming/Amesh_CV_v2.docx",
"Incoming/Amesh_CV_v3.docx",
"Incoming/Amesh_CV_v4.docx",
"Incoming/Amesh_CV_v6.docx",
"Incoming/Amesh_Q_v1.docx",
"Incoming/MIT/Akash_MIT_SoP_v1.docx",
"Incoming/MIT/Akash_MIT_SoP_v2.docx")

tibble(files) %>%
separate(files, into = c("lv1", "lv2", "lv3"), sep = "/", fill = "left") %>%
mutate("version" = str_extract(lv3, regex("v\d+")))
#> # A tibble: 8 x 4
#> lv1 lv2 lv3 version
#>
#> 1 Incoming Amesh_CV_v1.docx v1
#> 2 Incoming Amesh_CV_v2.docx v2
#> 3 Incoming Amesh_CV_v3.docx v3
#> 4 Incoming Amesh_CV_v4.docx v4
#> 5 Incoming Amesh_CV_v6.docx v6
#> 6 Incoming Amesh_Q_v1.docx v1
#> 7 Incoming MIT Akash_MIT_SoP_v1.docx v1
#> 8 Incoming MIT Akash_MIT_SoP_v2.docx v2

This above command how to use :
details<-data.frame(dir("E:/Review/Angshuman_Baruah", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE))
for this data

Nidha · April 9, 2019, 2:22pm

details<-data.frame(dir("E:/Review/Nidha", pattern=NULL, all.files=FALSE,
full.names=FALSE, recursive = TRUE))

files <- data.frame(details, stringsAsFactors = FALSE)
View(files)

tibble(files) %>%
separate(files, into = c("lv1", "lv2", "lv3"), sep = "/", fill = "left") %>%
mutate("version" = str_extract(lv3, regex("v\d+")))

Output:::::

A tibble: 49 x 4

lv1 lv2 lv3 version

1 NA NA 1:49 NA
2 NA NA 1:49 NA
3 NA NA 1:49 NA
4 NA NA 1:49 NA
5 NA NA 1:49 NA
6 NA NA 1:49 NA
7 NA NA 1:49 NA
8 NA NA 1:49 NA
9 NA NA 1:49 NA
10 NA NA 1:49 NA

... with 39 more rows

mara · April 9, 2019, 2:25pm

Can you please look at the reprex materials I linked to earlier. Though obviously I don't have your actual system drive, it's very hard to tell what's going on with unformatted code.

system · April 30, 2019, 2:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.