Modifying a column of appended datasets

Hi. I am having a problem modifying a column of an appended data frame. I used the map_df() function to append the datasets and create a new column entitled 'gauge.location'. The row names for that column correspond to different river gauge locations in Ireland. I read in and appended the datasets with the following code:

myread <- function(mfile) {
t1 <- read.csv(mfile)
t1 <- mutate(t1, gauge.location = mfile)
}

MDF <- list.files(pattern = "*.csv") %>%
map_df(~ myread(.))

However, the problem is that the location names all have the suffix 'MDF.csv'. Thus instead of a row label referring only to the location of a particular river gauge, e.g., 'Antrim', it reads 'Antrim MDF.csv' instead. Because I have eight appended datasets corresponding to eight different locations, I need to remove the MDF.csv suffix from all the locations names (see the output below for an example).

DT.Index FQ. .m3.s. gauge.location
1 01/01/2015 09:00 8.805 Antrim MDF.csv
2 02/01/2015 09:00 7.260 Antrim MDF.csv
3 03/01/2015 09:00 7.626 Antrim MDF.csv
4 04/01/2015 09:00 6.168 Antrim MDF.csv
5 05/01/2015 09:00 5.623 Antrim MDF.csv
6 06/01/2015 09:00 6.153 Ballinderry MDF.csv
7 07/01/2015 09:00 11.009 Ballinderry MDF.csv
8 08/01/2015 09:00 7.007 Ballinderry MDF.csv
9 09/01/2015 09:00 8.627 Ballinderry MDF.csv
10 10/01/2015 09:00 9.042 Ballinderry MDF.csv

This is what I am trying to do:

DT.Index FQ. .m3.s. gauge.location
1 01/01/2015 09:00 8.805 Antrim
2 02/01/2015 09:00 7.260 Antrim
3 03/01/2015 09:00 7.626 Antrim
4 04/01/2015 09:00 6.168 Antrim
5 05/01/2015 09:00 5.623 Antrim
6 06/01/2015 09:00 6.153 Ballinderry
7 07/01/2015 09:00 11.009 Ballinderry
8 08/01/2015 09:00 7.007 Ballinderry
9 09/01/2015 09:00 8.627 Ballinderry
10 10/01/2015 09:00 9.042 Ballinderry

The data frame consists of 11688 observations and 3 variables. Two of the three variables (i.e.,'DT.Index' and 'gauge.location') are character variables, whereas the third variable (FQ..m3.s.) is numeric. This is my first post on here, so apologies if I haven't structured this post in the conventional way.

Kind regards

Gary

Hi @Proteus73, and welcome to the forum! For future reference, you may want to consult this post of how to create a good reprex. If we cannot reproduce the data you are working with or at least recreate something with a similar structure, there isn't really any way to test whether a proposed routine would solve your problem. Instead, we have to go out on a limb and "hope" that it will. That said, based on your explanation, it sounds as though a simply regex routine should do the trick.

There are multiple ways to do what follows, but they all amount to targeting the gauge.location column and simply removing all trailing MDF.csv suffixes. Below, I provide three slightly different approaches, though they should all produce equivalent results:

library(tidyverse)

# create data similar to what was shared
df <- data.frame(
  stringsAsFactors = FALSE,
          DT.Index = c("01/01/2015 09:00",
                       "02/01/2015 09:00","03/01/2015 09:00","04/01/2015 09:00",
                       "05/01/2015 09:00","06/01/2015 09:00","07/01/2015 09:00",
                       "08/01/2015 09:00","09/01/2015 09:00","10/01/2015 09:00"),
         FQ..m3.s. = c(8.805,7.26,7.626,6.168,
                       5.623,6.153,11.009,7.007,8.627,9.042),
    gauge.location = c("Antrim MDF.csv",
                       "Antrim MDF.csv","Antrim MDF.csv","Antrim MDF.csv","Antrim MDF.csv",
                       "Ballinderry MDF.csv","Ballinderry MDF.csv",
                       "Ballinderry MDF.csv","Ballinderry MDF.csv",
                       "Ballinderry MDF.csv")
)

# use mutate and str_remove to remove unwanted suffixed in the gauge.location
# column
df %>% 
  mutate(gauge.location = stringr::str_remove(gauge.location, ' MDF\\.csv
```))
#>            DT.Index FQ..m3.s. gauge.location
#> 1  01/01/2015 09:00     8.805         Antrim
#> 2  02/01/2015 09:00     7.260         Antrim
#> 3  03/01/2015 09:00     7.626         Antrim
#> 4  04/01/2015 09:00     6.168         Antrim
#> 5  05/01/2015 09:00     5.623         Antrim
#> 6  06/01/2015 09:00     6.153    Ballinderry
#> 7  07/01/2015 09:00    11.009    Ballinderry
#> 8  08/01/2015 09:00     7.007    Ballinderry
#> 9  09/01/2015 09:00     8.627    Ballinderry
#> 10 10/01/2015 09:00     9.042    Ballinderry

# use mutate and str_replace to replace unwanted suffixed in the gauge.location
# column
df %>% 
  mutate(gauge.location = stringr::str_replace(gauge.location, ' MDF\\.csv
```, ''))
#>            DT.Index FQ..m3.s. gauge.location
#> 1  01/01/2015 09:00     8.805         Antrim
#> 2  02/01/2015 09:00     7.260         Antrim
#> 3  03/01/2015 09:00     7.626         Antrim
#> 4  04/01/2015 09:00     6.168         Antrim
#> 5  05/01/2015 09:00     5.623         Antrim
#> 6  06/01/2015 09:00     6.153    Ballinderry
#> 7  07/01/2015 09:00    11.009    Ballinderry
#> 8  08/01/2015 09:00     7.007    Ballinderry
#> 9  09/01/2015 09:00     8.627    Ballinderry
#> 10 10/01/2015 09:00     9.042    Ballinderry

# use mutate and gsub to replace unwanted suffixed in the gauge.location column
df %>% 
  mutate(gauge.location = gsub(' MDF\\.csv
```, '', gauge.location))
#>            DT.Index FQ..m3.s. gauge.location
#> 1  01/01/2015 09:00     8.805         Antrim
#> 2  02/01/2015 09:00     7.260         Antrim
#> 3  03/01/2015 09:00     7.626         Antrim
#> 4  04/01/2015 09:00     6.168         Antrim
#> 5  05/01/2015 09:00     5.623         Antrim
#> 6  06/01/2015 09:00     6.153    Ballinderry
#> 7  07/01/2015 09:00    11.009    Ballinderry
#> 8  08/01/2015 09:00     7.007    Ballinderry
#> 9  09/01/2015 09:00     8.627    Ballinderry
#> 10 10/01/2015 09:00     9.042    Ballinderry
1 Like

Thank you so much for your help. Your code gave me an idea of what I needed to do. In the end, I only needed one more line of code. Just in case you're interested, here is the piece of code I needed:

MDF <- list.files(pattern = "*.csv") %>%
map_df(~ myread(.)) %>%
mutate(gauge.location = str_remove_all(gauge.location, "MDF.csv"))

I am a bit of a newbie btw, so apologies if I am not up to speed with how these things are done. Thanks once again for you help hendrikvanb.

Since you know you're stripping the last four characters from the file name, you could also modify your myread() function, like this:

myread <- function(mfile) {
  t1 <- read.csv(mfile)
  t1 <- mutate(t1, gauge.location = mfile) %>% 
    separate(gauge.location, into = 'gauge.location', sep = -4)
}

and remove the suffix from the start.

Oh, I missed that your filenames included ' MDF,csv': in that case, in your solution, @Proteus73, you may also want to strip the space character between the location name and 'MDF.csv' by using

str_remove_all(gauge.location, " MDF.csv")) # note extra space before MDF
1 Like

You're very welcome, @Proteus73. I'm glad you managed to get it sorted! Just a word of caution: based on the data you show, the last line of code may leave you with trailing blanks in the gauge.location column. Perhaps this isn't an issue for your, but it is the kind of thing that can easily trip you up when you try to filter your data and the results returned are not what you may have expected!

For safety, I'd simply replace this line of code

mutate(gauge.location = str_remove_all(gauge.location, "MDF.csv"))

with this

mutate(gauge.location = str_remove(gauge.location, " MDF.csv"))

or, if you really want, you can easily remove leading and trailing blanks from any string using str_squish():

mutate(gauge.location = str_remove(gauge.location, "MDF.csv") %>% 
         str_squish())
2 Likes

Hi dromano. I wasn't even aware of the space character issue, so thanks for pointing that out to me.

1 Like

Hey hendrikvanb. Like I was saying in another reply, I was not aware of the character space issue. Your solutions (and dromano's) work perfectly. Big thanks again.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.