Hello,
I am using Yelp datasets to do some projects.
Aim: The original data is too huge, it contains over 2 million rows of data. So I would like to pick 50,000 rows from it.
- The first step is picking data from the review dataset, so I randomly choose 50,000 rows of data and I was successful, and I put all those data to review5.csv.
Then I want to use these 50,000 rows(review5.csv) as the standard to choose data from all other datasets.
In all datasets, they contain user id(user_id). - Then I made a list user_id_c contains 50,000 chrs.
- I did a small trail in a small dataset "Sells", it contains only 400 rows.
User.ID Gender Age EstimatedSalary Purchased
1 15624510 Male 19 19000 0
2 15810944 Male 35 20000 0
3 15668575 Female 26 43000 0
4 15603246 Female 27 57000 0
5 15804002 Male 19 76000 0
6 15728773 Male 27 58000 0
7 15598044 Female 27 84000 0
8 15694829 Female 32 150000 1
9 15600575 Male 25 33000 0
10 15727311 Female 35 65000 0
....
I made a small list contains the variables I like to use and filter:
a = c(19000,20000,57000,76000)
library(tidyverse)
dataset1 = filter(dataset,EstimatedSalary==a)
It was very successful:
User.ID Gender Age EstimatedSalary Purchased
1 15624510 Male 19 19000 0
2 15810944 Male 35 20000 0
3 15631159 Male 47 20000 1
4 15639277 Male 23 20000 0
5 15688172 Female 59 76000 1
6 15566689 Female 35 57000 0
7 15654296 Female 50 20000 1
- I apply this method in to the huge dataset:
> user_id_c
[1] "VfHVqfE3kWu1uhR6DYQY9A" "4Ngla54QXt6oHJsKmdVoSQ" "V3t6VJNcO7yXslIJHG7nyA" "yyH5S9mMOADRehpTzqlO1g" "nCuv2BqIYecLYyJDD7OkVQ"
[6] "v18P5fNZAJiWcHKAxjsW6A" "ffm9PrabyPFGxoMb6XqlBw" "rig21riwrgo6--ix7liMCQ" "4g8YeMIGbbeXpJH6kDiiNg" "8H5AlP3HdKm-tn9GgoKv_w"
......
and
setwd("D:/DATA/yelp_dataset")
user = read.csv('user.csv')
review = read.csv('review5.csv')
user_id_c = review5$user_id
user5 = filter(user,user_id==user_id_c)
however, it was saying
longer object length is not a multiple of shorter object length