Some newbie questions.

yoasty · January 5, 2020, 10:08pm

Quick background: extensively analysed with SAS and SPSS in the past, followed a HE course a couple of years ago that used R and found it not too hard, took getting used too, but was not that hard. Haven't used it since.

To get back into using R I thought I'd try to set up a simple dataset+analysis. With data-management to aggregate over time.

I want to use the aggregated daily visitor-numbers per os per day from the analytics.usa.gov site.
os <- read_csv("https://analytics.usa.gov/data/live/os.csv")

The aggregation should happen by occasionally importing the most recent version, taking the union of the two and adding the ones that do not occur in the current table for the date/os combination.

I'd need to check if no new OSs have started being included. and to add monitoring info to a file.

For the analysis I'd use a categorization table with the OS names categorised to mobiile, desktop,gaming-console or other.
In the resulting tables or dashboards I'd want the monthly and rearly sums, percetages etc..

My questions are:

I read that using data.tables is probably best for its SQL ability. But Should the data stay there? I'll make separate backups, of course, But I wonder there is advice on what the best way to store the data is? base- frames? Data-tables?
In sas I could just use a format (only showing the month/year) to allow summarising per month/year, But in SPSS I would need to recode. How are reports grouped by month/year handled in R-reporting? Do I need to recode (probably by using cast in SQL)?
I probably want to have a clickable couple of summaries on a web-page. This is new to me, but it was made to sound really easy. Is shiny the best way to start looking?

I now it is a simple example, but I hope it will prove a simple way to get these basics done. I am also interested in how well Linux and Chromebooks are doing.
I considered using google-sheets, but you cannot really program and data-manage with sheets imho.

Thanks for your help.
Y.

grosscol · January 6, 2020, 2:29pm

Merging two data frames by adding new to old

The dplry library includes a function full_join that does what I think you intend for merging data.

library(tibble) # Used for reproducible example
library(dplyr)

old <- tribble(~date,~os,~visits,
"2019-10-08","Android",   82,
"2019-10-08","Chrome OS", 52,
"2019-10-09","Android",   70,
"2019-10-09","Chrome OS", 30)

new <- tribble(~date,~os,~visits,
"2019-10-09","Android",   70,
"2019-10-09","Chrome OS", 30,
"2019-10-10","Android",   19,
"2019-10-10","Chrome OS", 99)

joined <- full_join(old, new)
joined

# A tibble: 6 x 3
#  date       os        visits
#  <chr>      <chr>      <dbl>
# 1 2019-10-08 Android       82
# 2 2019-10-08 Chrome OS     52
# 3 2019-10-09 Android       70
# 4 2019-10-09 Chrome OS     30
# 5 2019-10-10 Android       19
# 6 2019-10-10 Chrome OS     99

Storing data

Data tables or frames are in memory constructs in R. You can write the data to disk as whatever format you like and read it back into either a table or frame. Using save() to store and .RData file and lock yourself into a particular class seems like a bad plan.

Serializing the data to RDS files with saveRDS and tracking them however you want might be useful for space considerations. Until I/O gets to be a pain, serializing to csv with write.csv() should work.

Summarizing data

Take a look at the dplyr package and tutorials for manipulating data in R. There are convenient functions like group_by and a pipe operator %>% to pass the resulting data along from one function to the next.

Keep SQL for DB queries

You shouldn't need to write any SQL for handling tabular data. Heck, you shouldn't need to do that for handling multiple data sets. Keep SQL in it's place and use it for querying databases.

Making a web page with R.

An entire question unto itself that should be addressed once you have R summarizing data the way you like. I suggest starting with an rmarkdown document to emit a plain html page or a pdf to start with. Only dive into shiny if you need something interactive.

yoasty · January 6, 2020, 5:47pm

Thanks for your time and expertise.

Thanks for your example, but doesn't that imply a simple concatenation of the sets?
I want a standard-routine to add the most recent version (to be called at any day/time) but that will remove any duplicates of date,os. Functionally: A union. Practically: A union but add only those records that do not already occur in the os table. So the first read should be the 90 days looking back, and the second run should be within every 90 days and simply add the records to complete the picture up till then. I see a merge as adding fields from two sets by a key. I want a union by date+os or an append with duplicates not added. Is that easily possible? Or does your union already do that?
Storing the data
Since the group-variable (os) may change there could be slight variation in the amount of records per day, but it should be in the order of 2250 per 90 days so a about 10.000 a year.
The second table would be table with os, os_cat (os_cat being mobile, desktop, console, other ) with currently 22 records and possibly the occasional addition.
Summarising the data can simply be done with group by and a left outer join to collect the os_cat. No need to store that, though I'll probably write to a file for download.
Most queries can easily be solved with sql and most data-,management can easily be described with sql. Limit the logic outside of it to what is necessary.

yoasty · January 6, 2020, 7:03pm

rbind(main, importd)
unique(main)
seems to do the trick for the join..
Thanks.

grosscol · January 6, 2020, 7:18pm

Full join is concatenation only if all of the columns are the same and none of the rows have identical values. In the example case above, see that both inputs have 4 rows and the result has 6 rows. The two rows that both data frames have in common only show up once in the result.

yoasty · January 7, 2020, 5:47am

Thanks for your help/expertise.
The dplry tip is probably the one I'll use the most. One of the other sources I hope to use is an online table where the rows and columns are alphabetically sorted, but can change over time. dplry may help me change the information back to a set with several group variables, and fact variables that can be aggregated over time.
Will take a bit of figuring out.
Thnx.

system · January 28, 2020, 5:54am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.