Forming clusters/groups in a dataset based on unique values from some of the columns using R

arvind · June 10, 2022, 10:05am

I have a following dataset, this is a sample dataset which depicts the Products and its features(in each columns). It contains 12 features/columns for the 21 Products/records.

Data.df <- structure(list(Product.group = c("A Box", "A Box", "A Box", "A Box", 
"A Box", "A Box", "A Box", "A Box", "B Box", "B Box", "B Box", 
"B Box", "B Box", "B Box", "B Box", "B Box", "C Box", "C Box", 
"C Box", "C Box", "C Box"), Performance = c("High", "High", "High", 
"Medium", "Medium", "Low", "Low", "Low", "High", "High", "High", 
"High", "Low", "Low", "Low", "Medium", "High", "High", "Low", 
"Low", "Low"), Family = c("A1", "A1", "A1", "A1", "A1", "A2", 
"A2", "A2", "B1", "B1", "B1", "B1", "B1", "B2", "B2", "B1", "C1", 
"C2", "C1", "C1", "C2"), Product.ID = c("A111", "A112", "A113", 
"A114", "A118", "A211", "A222", "A347", "AX12", "AX14", "AX16", 
"AX18", "AY78", "AY89", "AY91", "B122", "AA11", "AA32", "AA43", 
"AC21", "AC43"), Function = c("ELEC", "ELEC", "ELEC", "ELEC", 
"ELEC", "ELEC", "ELEC", "ELEC", "ELEC", "ELEC", "GAS", "GAS", 
"ELEC", "ELEC", "ELEC", "ELEC", "GAS", "GAS", "GAS", "GAS", "GAS"
), Voltage = c("G", "G", "G", "G", "G", "G", "G", "G", "G", "G", 
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G"), Gas.stage = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "A112", "A112", NA, NA, NA, 
NA, "A150", "A150", "A240", "A240", "A240"), Electric = c(NA, 
NA, "22KW", NA, NA, NA, NA, NA, "22KW", NA, NA, NA, NA, NA, "7.5KW", 
NA, NA, NA, NA, NA, NA), Drive = c("Direct", "Direct", "Direct", 
"Direct", "Direct", "Direct", "Direct", "Direct", "Direct", "Direct", 
"Direct", "Direct", "Direct", "Direct", "Direct", "Direct", "Direct", 
"Direct", "Direct", "Direct", "Direct"), Exhaust = c(NA, NA, 
NA, NA, NA, "Single", NA, NA, NA, NA, "Single", NA, NA, NA, NA, 
NA, "Single", NA, NA, "Double", "Double"), Fuse = c("15A", "15A", 
"15A", "15A", "15A", "15A", "15A", "15A", "15A", "15A", "15A", 
"15A", "15A", "15A", "15A", "15A", "15A", "15A", "20A", NA, "15A"
), Accessory = c(NA, NA, NA, NA, NA, "Installed", NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Installed", NA)), row.names = c(NA, 
21L), class = "data.frame")

Is there a way to subset this dataset into multiple groups based on unique values from some of the columns/features. The columns that are used for grouping are

Product group
Performance
Family
Function
Voltage
Need to subset the entire data based on unique values from these columns in the same order mentioned as above and give this subset an unique group id.

I have just started learning about clustering and it would be of great help if anyone could advise on this.

Thank you

pieterjanvc · June 10, 2022, 10:31am

Hello,

I think using the dplyr package from the Tidyverse is the easiest way to do this.

When you use this code (did not copy the large dataframe):

library(tidyverse)
Data.df = Data.df %>% 
  #Group the data by specific columns
  group_by(Product.group, Performance, Family, Function, Voltage) %>% 
  #Create a new column that contains the group ID
  mutate(groupId = cur_group_id()) %>% 
  ungroup() %>% 
  #Put the groupId column first (optional)
  select(groupId, everything())

This is the output

# A tibble: 21 x 13
   groupId Product.group Performance Family Product.ID Function Voltage Gas.stage
     <int> <chr>         <chr>       <chr>  <chr>      <chr>    <chr>   <chr>    
 1       1 A Box         High        A1     A111       ELEC     G       NA       
 2       1 A Box         High        A1     A112       ELEC     G       NA       
 3       1 A Box         High        A1     A113       ELEC     G       NA       
 4       3 A Box         Medium      A1     A114       ELEC     G       NA       
 5       3 A Box         Medium      A1     A118       ELEC     G       NA       
 6       2 A Box         Low         A2     A211       ELEC     G       NA       
 7       2 A Box         Low         A2     A222       ELEC     G       NA       
 8       2 A Box         Low         A2     A347       ELEC     G       NA       
 9       4 B Box         High        B1     AX12       ELEC     G       NA       
10       4 B Box         High        B1     AX14       ELEC     G       NA       
# ... with 11 more rows, and 5 more variables: Electric <chr>, Drive <chr>,
#   Exhaust <chr>, Fuse <chr>, Accessory <chr>

Where you can see that there is now a unique group id for the grouping as specified by you.

To learn more about dplyr syntax, you can read the online documentation

Hope this helps,
PJ

arvind · June 10, 2022, 10:49am

Thank you so much for this. May i know how you do sort the dataset using group id after this

pieterjanvc · June 10, 2022, 10:58am

Hi,

You can again use functions from the dplyr package, in this case: arrange.

Example

Data.df %>% arrange(groupId, Performance)

PJ

arvind · June 10, 2022, 11:15am

Thank you for the help.
Also I have been wondering if its possible to count the frequency of the Unique values including nulls(count of unique values and nulls) in each column/feature in a group excluding the columns that are used for grouping. And is it possible to compare the frequency of the column values from one group vs another?

pieterjanvc · June 10, 2022, 4:14pm

Hello,

First of all: It is better practice to open a new topic if the original question has been answered (i.e., solution found).

Second, if you have a question, please provide a reprex where you at least show us your own attempts so we can work from there. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

I suggest you create a new issue and rephrase your question with the proper data and code (or at least your attempt) in form of a reprex.

Good luck!
PJ

system · June 17, 2022, 4:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.