Filter problem in large dataset

azhangbojun · January 17, 2021, 7:32am

Hello,
I am using Yelp datasets to do some projects.
Aim: The original data is too huge, it contains over 2 million rows of data. So I would like to pick 50,000 rows from it.

The first step is picking data from the review dataset, so I randomly choose 50,000 rows of data and I was successful, and I put all those data to review5.csv.
Then I want to use these 50,000 rows(review5.csv) as the standard to choose data from all other datasets.
In all datasets, they contain user id(user_id).
Then I made a list user_id_c contains 50,000 chrs.
I did a small trail in a small dataset "Sells", it contains only 400 rows.

   User.ID Gender Age EstimatedSalary Purchased
1   15624510   Male  19           19000         0
2   15810944   Male  35           20000         0
3   15668575 Female  26           43000         0
4   15603246 Female  27           57000         0
5   15804002   Male  19           76000         0
6   15728773   Male  27           58000         0
7   15598044 Female  27           84000         0
8   15694829 Female  32          150000         1
9   15600575   Male  25           33000         0
10  15727311 Female  35           65000         0
....

I made a small list contains the variables I like to use and filter:

a = c(19000,20000,57000,76000)
library(tidyverse)
dataset1 = filter(dataset,EstimatedSalary==a)

It was very successful:

   User.ID Gender Age EstimatedSalary Purchased
1 15624510   Male  19           19000         0
2 15810944   Male  35           20000         0
3 15631159   Male  47           20000         1
4 15639277   Male  23           20000         0
5 15688172 Female  59           76000         1
6 15566689 Female  35           57000         0
7 15654296 Female  50           20000         1

I apply this method in to the huge dataset:

> user_id_c
   [1] "VfHVqfE3kWu1uhR6DYQY9A" "4Ngla54QXt6oHJsKmdVoSQ" "V3t6VJNcO7yXslIJHG7nyA" "yyH5S9mMOADRehpTzqlO1g" "nCuv2BqIYecLYyJDD7OkVQ"
   [6] "v18P5fNZAJiWcHKAxjsW6A" "ffm9PrabyPFGxoMb6XqlBw" "rig21riwrgo6--ix7liMCQ" "4g8YeMIGbbeXpJH6kDiiNg" "8H5AlP3HdKm-tn9GgoKv_w"
......

and

setwd("D:/DATA/yelp_dataset")
user = read.csv('user.csv')
review = read.csv('review5.csv')
user_id_c = review5$user_id
user5 = filter(user,user_id==user_id_c)

however, it was saying

  longer object length is not a multiple of shorter object length

jrkrideau · January 17, 2021, 11:32pm

This is just a guess but it looks like user & user_id are vectors of different lengths.

system · February 7, 2021, 11:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.