Barplot of different groups with conditions

blackish952 · June 6, 2018, 10:49am

I am working on a data science project at work and my goal is to provide a summary out of the huge dataset.

For instance, I want to know how many customers ordered the House Brand one time, two times, more than two times.
How many ordered the house brand and the nonHouse Brand?
How many ordered just the nonHouse Brand?

How can I achieve this?

Sample dataset

PRODUCT_SUB_LINE_DESCR	  MAJOR_CATEGORY_DESCR	     CUST_REGION_DESCR
SUNDRY	                       SMALL EQUIP	         NORTH EAST REGION
SUNDRY	                       SMALL EQUIP	         SOUTH EAST REGION
SUNDRY	                       SMALL EQUIP	         SOUTH EAST REGION
SUNDRY	                       SMALL EQUIP	         NORTH EAST REGION
SUNDRY	                       PREVENTIVE	         SOUTH CENTRAL REGION
SUNDRY	                       PREVENTIVE	         SOUTH EAST REGION
SUNDRY	                       PREVENTIVE	         SOUTH EAST REGION
SUNDRY	                       SMALL EQUIP	         NORTH CENTRAL REGION
SUNDRY	                       SMALL EQUIP	         MOUNTAIN WEST REGION
SUNDRY	                       SMALL EQUIP	         MOUNTAIN WEST REGION
SUNDRY	                       COMPOSITE	         NORTH CENTRAL REGION
SUNDRY	                       COMPOSITE	         NORTH CENTRAL REGION
SUNDRY	                       COMPOSITE	         OHIO VALLEY REGION
SUNDRY	                       COMPOSITE	         NORTH EAST REGION

Sales	QtySold	     MFGCOST	MarginDollars	new_ProductName
209.97	3	          134.55	72.72	         no
-76.15	-1	          -44.85	-30.4	         no
275.6	2	          162.5	    109.84	         no
138.7	1	          81.25	    55.82	         no
226	    2	          136	    87.28	         no
115	    1	          68	    45.64	         no
210.7	2	          136	    71.98	         no
29	    1	          18.85	    9.77	         no
29	    1	          18.85	    9.77	         no
46.32	2	          37.7	    7.86	         no
159.86	1	          132.4	    24.81	         no
441.3	2	          264.8	    171.2	         no
209.62	1	          132.4	    74.57	         no
209.62	1	          132.4	    74.57	         no

This is not the original dataset. I basically added a new column to my original data set for my Decision Tree Analysis later. But for now, I want to produce some plots here. Private Label is considered to be House Brand.

 new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE 
                          LABEL","yes","no")
 data = data.frame(new_Dataset, new_ProductName)

mishabalyasin · June 6, 2018, 12:50pm

When you say "huge dataset" what exactly do you mean? Millions of rows, billions ... ? Is your data stored in a DB or are you working against in-memory dataset?

If your data is actually in a DB, then I recommend looking into dbplot package. It leverages dplyr to find all the numbers on a remote DB and only bring summary back (which is usually tiny). This is done with ggplot2-like interface.

If you are working against in-memory data, then take a look at geom_bar geom from ggplot2 package. There are examples on a help page that should be enough to get you started.

blackish952 · June 6, 2018, 12:52pm

@mishabalyasin
I think I am working on one day data set and it has 5143 rows. I will work on data set of 15 years or more, so yes, potentially, it can be that big.
I am importing the data set from a .csv file.

mara · June 6, 2018, 12:57pm

It's always good to be specific with "how big is big?"

A billion rows would be > 530 years at a rate of 5,143 rows per day.

blackish952 · June 6, 2018, 1:00pm

I am sorry.
I am new. I have not worked with such a big data set like that.
To me, it is big. I am still learning, and there are a lot to take in.
My company asks me to do many tasks and it is very interesting to dive in Data Science but also challenging since my skill sets are premature.

mara · June 6, 2018, 1:01pm

No problem! Just giving you a heads up as to why it's helpful!

martin.R · June 6, 2018, 1:04pm

You just need enough RAM to handle the data. Otherwise you should look at using a database and retrieve the summary data required for the plot (as advised by @mishabalyasin), which will be minimal in size.

Even ca. 28 million records (15 years of daily data) should not be limiting, but you may wish to read in only the columns you require if there are a large number of redundant columns.

blackish952 · June 6, 2018, 1:06pm

@martin.R
Yes, I did delete so many columns.
However, I don't know where to find more resources on how to summarize data based on the give criteria like I described in my post.
Do you have a reference?

martin.R · June 6, 2018, 1:09pm

You could start with becoming familiar with dplyr for summarising data if you don't know it already:
https://dplyr.tidyverse.org/articles/dplyr.html

However, ggplot2 can sometimes do the summarising for you:

Hopefully the examples should help you.

blackish952 · June 6, 2018, 1:11pm

Thanks everyone!

I am a data-driven person and this working opportunity with my current company gives me a lot of experiences and I am grateful for your help and support .