R converting into more friendly names

user124578 · January 10, 2019, 12:25am

I have a list of hostnames that i would like to convert to a more friendly names in R. Is this possible to do please?

Host name
95b4ae6d890e4c46986d91d7ac4bf08200000W
95b4ae6d890e4c46986d91d7ac4bf08200000W
95b4ae6d890e4c46986d91d7ac4bf08200000V
95b4ae6d890e4c46986d91d7ac4bf08200000V
95b4ae6d890e4c46986d91d7ac4bf08200000Z
95b4ae6d890e4c46986d91d7ac4bf08200000Z
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf082000011
95b4ae6d890e4c46986d91d7ac4bf08200000H
95b4ae6d890e4c46986d91d7ac4bf08200000H

jdlong · January 10, 2019, 7:05pm

you could do this all sorts of ways. What did you have in mind?

You could map each of these to a number. Or you could map each to the name of a former President of the US. Or you could make each of them a noble gas.

user124578 · January 10, 2019, 7:07pm

I was hoping for host1,host2,host3, and so on. Just to make it more readable.

taras · January 10, 2019, 7:13pm

How is this stored? A list, a vector, a column of a table?
In a nutshell, my idea would be to generate a vector of friendly names, and then cbind it to the table, or pass it into a list.

E.g.

paste0("host", seq(1:10))

gives you this:

[1] "host1"  "host2"  "host3"  "host4"  "host5"  "host6"  "host7"  "host8"  "host9"  "host10"

Only instead of 10 you'll need to pass something like nrow or length depending on your initial object.

jdlong · January 10, 2019, 7:15pm

of maybe something like this:

I start with a data frame named df containing one column, names:

df
#>         names
#> 1  wyezsnmpct
#> 2  loifrapnuq
#> 3  mcotjfeglb
#> 4  zdaelstqor
#> 5  soxtzagqkr
#> 6  rjocznhtqu
#> 7  zspjlkfwat
#> 8  zmqtpdyxcw
#> 9  ldryxkighq
#> 10 eylhsudnom

Then using the dplyr package I calculate a new column based on the row number:

library(dplyr)

df %>%
  mutate(nice_name = paste0("host_", row_number()))
#>         names nice_name
#> 1  wyezsnmpct    host_1
#> 2  loifrapnuq    host_2
#> 3  mcotjfeglb    host_3
#> 4  zdaelstqor    host_4
#> 5  soxtzagqkr    host_5
#> 6  rjocznhtqu    host_6
#> 7  zspjlkfwat    host_7
#> 8  zmqtpdyxcw    host_8
#> 9  ldryxkighq    host_9
#> 10 eylhsudnom   host_10

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

user124578 · January 10, 2019, 7:15pm

It's stored in a data frame as column.

taras · January 10, 2019, 7:18pm

Something like:

library(tidyverse)
df <- tibble(host_name = c(
             "95b4ae6d890e4c46986d91d7ac4bf08200000W",
             "95b4ae6d890e4c46986d91d7ac4bf08200000W",
             "95b4ae6d890e4c46986d91d7ac4bf08200000V",
             "95b4ae6d890e4c46986d91d7ac4bf08200000V",
             "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
             "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf082000011",
             "95b4ae6d890e4c46986d91d7ac4bf08200000H",
             "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df <- cbind(df, name = paste("host", seq(1:nrow(df))))

Gives you this:

                                host_name   name
1  95b4ae6d890e4c46986d91d7ac4bf08200000W  host1
2  95b4ae6d890e4c46986d91d7ac4bf08200000W  host2
3  95b4ae6d890e4c46986d91d7ac4bf08200000V  host3
4  95b4ae6d890e4c46986d91d7ac4bf08200000V  host4
5  95b4ae6d890e4c46986d91d7ac4bf08200000Z  host5
6  95b4ae6d890e4c46986d91d7ac4bf08200000Z  host6
7  95b4ae6d890e4c46986d91d7ac4bf082000011  host7
8  95b4ae6d890e4c46986d91d7ac4bf082000011  host8
9  95b4ae6d890e4c46986d91d7ac4bf082000011  host9
10 95b4ae6d890e4c46986d91d7ac4bf082000011 host10
11 95b4ae6d890e4c46986d91d7ac4bf08200000H host11
12 95b4ae6d890e4c46986d91d7ac4bf08200000H host12

taras · January 10, 2019, 7:21pm

Yes! I wanted this, but couldn't remember the function for getting the index / row number. Apparently, it is row_number(). Who would have thought.

hoelk · January 10, 2019, 7:21pm

The solutions posted here do not account for the fact that some of your hosts are the same..
When i need to enumerate items, I use this trick:

x <- c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"
)

paste0("host", xtfrm(x))

which gives you

 [1] "host3" "host3" "host2" "host2" "host4" "host4" "host5" "host5" "host5" "host5" "host1" "host1"

edit: originally hat the hacky as.integer(as.factor(x)) till i remembered xtfrm()

user124578 · January 10, 2019, 7:22pm

The only issue here is that the same hostname may appear more than once.

taras · January 10, 2019, 7:23pm

~~How? It depends on row numbers, which are sequential and unique (think index)~~

Never mind me, I'm an idiot. I see it now.

jdlong · January 10, 2019, 7:26pm

ohhh.. well @hoelk is spot on with his solution. We could also do this with a more tidyverse solution using the power of group_by:


library(tidyverse)
df <- tibble(host_name = c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df %>%
  group_by(host_name) %>%
  summarize() %>%
  mutate(nice_name = paste0("host_", row_number()))
#> # A tibble: 5 x 2
#>   host_name                              nice_name
#>   <chr>                                  <chr>    
#> 1 95b4ae6d890e4c46986d91d7ac4bf08200000H host_1   
#> 2 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#> 3 95b4ae6d890e4c46986d91d7ac4bf08200000W host_3   
#> 4 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_4   
#> 5 95b4ae6d890e4c46986d91d7ac4bf082000011 host_5

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

taras · January 10, 2019, 7:32pm

Yes. Or, instead of group_by(), do df %>% select(host_name) %>% distinct() to get a dim "lookup" table of distinct names (that's what I thought this table column was!), and engineer friendly names there.

user124578 · January 10, 2019, 7:36pm

Thanks for this! i don't need them to be grouped by host_name. if i remove group_by some hostname get more tha one name.

taras · January 10, 2019, 7:39pm

Well, you kind of do, whether it is group_by() or distinct(), you'd need to make a list of distinct host names. You'd obviously handle it separately in a different table. Think dimensional table in a relational database...

My 2 cents, FWIW. I may be wrong.

jdlong · January 10, 2019, 7:40pm

I'm just using group_by for the side effect that it makes things unique. Taras recommended distinct (great choice) or even unique which is another option.



library(tidyverse)
df <- tibble(host_name = c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df %>%
  unique() %>%
  mutate(nice_name = paste0("host_", row_number()))
#> # A tibble: 5 x 2
#>   host_name                              nice_name
#>   <chr>                                  <chr>    
#> 1 95b4ae6d890e4c46986d91d7ac4bf08200000W host_1   
#> 2 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#> 3 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_3   
#> 4 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#> 5 95b4ae6d890e4c46986d91d7ac4bf08200000H host_5

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

taras · January 10, 2019, 7:42pm

Fake news, I recommended distinct()! (I guess they give same results though, so pick your poison)
There are many paths to one... solution

jdlong · January 10, 2019, 7:44pm

did not.. YOU'RE fake news!

Ok, so I changed it while you were responding

user124578 · January 10, 2019, 7:46pm

Thanks again! This doesn't give me what I am after. I need to keep the same number of host names. The above example still summaries the host names. I want to see the host name appear more than once. Thanks

jdlong · January 10, 2019, 7:48pm

oh... well just join it back to your original data:

library(tidyverse)
df <- tibble(host_name = c(
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000W",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000V",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf08200000Z",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf082000011",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H",
  "95b4ae6d890e4c46986d91d7ac4bf08200000H"))

df %>%
  unique() %>%
  mutate(nice_name = paste0("host_", row_number())) %>%
  left_join(df)
#> Joining, by = "host_name"
#> # A tibble: 12 x 2
#>    host_name                              nice_name
#>    <chr>                                  <chr>    
#>  1 95b4ae6d890e4c46986d91d7ac4bf08200000W host_1   
#>  2 95b4ae6d890e4c46986d91d7ac4bf08200000W host_1   
#>  3 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#>  4 95b4ae6d890e4c46986d91d7ac4bf08200000V host_2   
#>  5 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_3   
#>  6 95b4ae6d890e4c46986d91d7ac4bf08200000Z host_3   
#>  7 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#>  8 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#>  9 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#> 10 95b4ae6d890e4c46986d91d7ac4bf082000011 host_4   
#> 11 95b4ae6d890e4c46986d91d7ac4bf08200000H host_5   
#> 12 95b4ae6d890e4c46986d91d7ac4bf08200000H host_5

^{Created on 2019-01-10 by the reprex package (v0.2.1)}