How to handle tags list in one column of a csv file

rbauduin · October 2, 2018, 9:37am

I have a csv flie of this kind:

"20180901","1.234,45","tag1,tag2,tag3"
"20180905","43,50","tag6,tag2"

i.e., the tags list is one column in the csv. How is this best managed with R, as the number of tags is ont always the same (so splitting in columns might not fit well), and I would like to be able to filet rows based on tags, eg only keep rows having tag1, or only keep rows with tag1 but not tag2.

Is there an easy way to handle this with R/RStdio ?

Thanks in advance!

mfherman · October 2, 2018, 11:24am

A simple approach would be to create new columns for each tag and fill those columns with 1 or 0 if that particular tag is contained in the original tag column. If you many tags, this approach could get somewhat cumbersome, but you should be able to abstract it into a function if you need.

library(tidyverse)

df <- tribble(
  ~code, ~value, ~tags,
  "20180901","1.234,45","tag1,tag2,tag3",
  "20180905","43,50","tag6,tag2"
  )

df %>%
  mutate(
    tag1 = if_else(str_detect(tags, "tag1"), 1, 0),
    tag2 = if_else(str_detect(tags, "tag2"), 1, 0), 
    tag3 = if_else(str_detect(tags, "tag3"), 1, 0)
  )
#> # A tibble: 2 x 6
#>   code     value    tags            tag1  tag2  tag3
#>   <chr>    <chr>    <chr>          <dbl> <dbl> <dbl>
#> 1 20180901 1.234,45 tag1,tag2,tag3     1     1     1
#> 2 20180905 43,50    tag6,tag2          0     1     0

^{Created on 2018-10-02 by the reprex package (v0.2.1)}

rbauduin · October 2, 2018, 11:43am

thanks @mfherman .

Do I understand correctly that this function would take the third column of the csv, split it on ,, then iterate over each tag extracted to set columns to 0 or 1? I'm an R beginner, so if there are interesting pointers to do that, I'm interested.

Thanks

mfherman · October 2, 2018, 12:24pm

That's what I was thinking, but then I started playing around a little I discovered the separate_rows() function from the tidyr package which essentially does all the work for you! From the documentation:

If a variable contains observations with multiple delimited values, this separates the values and places each one in its own row.

library(tidyverse)

df <- tribble(
  ~code, ~value, ~tags,
  "20180901","1.234,45","tag1,tag2,tag3",
  "20180905","43,50","tag6,tag2"
)

tag_separated <- separate_rows(df, tags, sep = ",") 
tag_separated
#> # A tibble: 5 x 3
#>   code     value    tags 
#>   <chr>    <chr>    <chr>
#> 1 20180901 1.234,45 tag1 
#> 2 20180901 1.234,45 tag2 
#> 3 20180901 1.234,45 tag3 
#> 4 20180905 43,50    tag6 
#> 5 20180905 43,50    tag2

After that magic, you could use spread to get it back into a wide form where each tag has its own column (if that is the form you want to work with)

tag_separated %>%
  mutate(tag_dummy = 1) %>%
  spread(tags, tag_dummy, fill = 0)
#> # A tibble: 2 x 6
#>   code     value     tag1  tag2  tag3  tag6
#>   <chr>    <chr>    <dbl> <dbl> <dbl> <dbl>
#> 1 20180901 1.234,45     1     1     1     0
#> 2 20180905 43,50        0     1     0     1

rbauduin · October 2, 2018, 2:17pm

It's nearly working @mfherman!
The problem I encounter occurs at the spread step because (I think) I have pairs of identical rows in the original csv, and I get:

Error: Duplicate identifiers for rows (465, 468), (495, 498), (1346, 1354), (464, 467), (494, 497), (319, 331), (321, 333), (1347, 1355), (463, 466), (493, 496), (1345, 1353), (318, 330)

I'll now take look at the spread doc to understand its working....

mfherman · October 2, 2018, 2:21pm

Ah yes, that can be a difficulty with using spread. There is a good discussion with some workarounds here that should help:

rbauduin · October 2, 2018, 2:24pm

I have found separate, which seems to do the job:

test <-separate(e,tags,sep=",",into = c("t1","t2","t3","t4","t5","t6","t7","t8","t9"))

of course, this is not as modular but possibly workable in my case.

I'm wondering, wouldn't it be easier ot have a vector of tags in that colum?

mfherman · October 2, 2018, 2:44pm

The form of your data frame all depends on what you need to do with. If you gave a little more detail about the final output/product you are looking for, maybe I would have some other ideas. separate would work if you don't need each tag to be in its own corresponding column. With your example data, you would get something like this:

library(tidyverse)

df <- tribble(
  ~code, ~value, ~tags,
  "20180901","1.234,45", "tag1,tag2,tag3",
  "20180905","43,50", "tag6,tag2"
)

separate(df, tags, sep = ",", into = c("t1", "t2", "t3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 2 x 5
#>   code     value    t1    t2    t3   
#>   <chr>    <chr>    <chr> <chr> <chr>
#> 1 20180901 1.234,45 tag1  tag2  tag3 
#> 2 20180905 43,50    tag6  tag2  <NA>

^{Created on 2018-10-02 by the reprex package (v0.2.1)}`

rbauduin · October 2, 2018, 3:07pm

I kept things simple in the beginning to focus on the problem, but here are some more details @mfherman.
I actually work on a csv file tracking my expenses, each row having six columns:

date
category (I don't use it, it is always Other)
amount
currency (always €)
note (I use that as a category, so notes are from a limited, but possibly growing set)
tags (variable number of tags from a possibly growing set)

Here is a example of what couldbe in such a csv:

"28/7/2018","Other","73,03","€","Ben Shop","Supermarket,Cash"
"5/8/2018","Other","15,08","€","Jon Shop","Market,Food"
"8/8/2018","Other","1,15","€","Baker","Cash,Food"
"25/8/2018","Other","35,69","€","Petrol","Car,CreditCard"

I want to analyse and chart data, for example with:

daily expenses in a bar chart + moving average lines
barchart of expenses total amount per tag, with lines or dots for median value, average value and count of expenses
barchart of expenses total amount per note, ....
possibly same barcharts, but with expenses count rather than total amount
histogram for expenses (bucketed in 10€ width buckets for example)
tag cloud for tags and note based on count and total amount

Suggestions are very welcome, as I'm only starting with R!

rensa · October 3, 2018, 12:57am

tidyr::separate() is the bomb Here's a recent thread that covered its particulars:

https://forum.posit.co/t/why-did-column-values-disappear-after-running-tidyr-separate/

rbauduin · October 4, 2018, 11:48am

rbauduin:

The problem I encounter occurs at the spread step because (I think) I have pairs of identical rows in the original csv, and I get:
Error: Duplicate identifiers for rows (465, 468), (495, 498), (1346, 1354), (464, 467), (494, 497), (319, 331), (321, 333), (1347, 1355), (463, 466), (493, 496), (1345, 1353), (318, 330)

This was solved by first adding an ID column with unique values:

data <- tibble::rowid_to_column(data,"ID")