I'm currently working on my data analytics bike-share capstone project and I've had some trouble searching for a way to do this, I'm probably wording this incorrectly but I'll try and explain it.
I'm working with a dataset of 13 rows that contains 4 of the following: start_station_name, start_station_id, start_lat and end_lng.
There are a bunch of missing info on those columns, but we have the data to deduce most of what the info is in order to salvage the data for the sake of cleaning (learning), so here is the question:
I was thinking of running something like:
IF start_lat = known_number_1 and start_lng = known_number_2 THEN mutate start_station_name = X and start_station_id = Y
is there any way to do something like this? Where should I look for info regarding it?
You'll be better off creating a csv file with all known long/lat with station names, and reading the file and joining with your original data.frame that has missing values.
If you post sample data frames - 1) base data and 2) station names using dput, we can give some specific guidance how to handle it.
Hello and thank you for the reply!
I already know how to export to csv but I have no idea on how to create the table that would suffice my needs
That is creating a table with every row that has missing values.
Could you point me at what kind of function I should look for this kind of task?
Sure thing! this is the head of the data right after I mergeed the dataset into a single table.
Also plz let me know if there is a better way to share it, it seems kinda messy so I've added a image.
head(full_data) #See the first 6 rows of data frame. Also tail(full_data)
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id
1 EC2DE40644C6B0F4 classic_bike 2022-05-23 23:06:58 2022-05-23 23:40:19 Wabash Ave & Grand Ave TA1307000117 Halsted St & Roscoe St TA1309000025
2 1C31AD03897EE385 classic_bike 2022-05-11 08:53:28 2022-05-11 09:31:22 DuSable Lake Shore Dr & Monroe St 13300 Field Blvd & South Water St 15534
3 1542FBEC830415CF classic_bike 2022-05-26 18:36:28 2022-05-26 18:58:18 Clinton St & Madison St TA1305000032 Wood St & Milwaukee Ave 13221
4 6FF59852924528F8 classic_bike 2022-05-10 07:30:07 2022-05-10 07:38:49 Clinton St & Madison St TA1305000032 Clark St & Randolph St TA1305000030
5 483C52CAAE12E3AC classic_bike 2022-05-10 17:31:56 2022-05-10 17:36:57 Clinton St & Madison St TA1305000032 Morgan St & Lake St TA1306000015
6 C0A3AA5A614DCE01 classic_bike 2022-05-04 14:48:55 2022-05-04 14:56:04 Carpenter St & Huron St 13196 Sangamon St & Washington Blvd 13409
start_lat start_lng end_lat end_lng member_casual
1 41.89147 -87.62676 41.94367 -87.64895 member
2 41.88096 -87.61674 41.88635 -87.61752 member
3 41.88224 -87.64107 41.90765 -87.67255 member
4 41.88224 -87.64107 41.88458 -87.63189 member
5 41.88224 -87.64107 41.88578 -87.65102 member
6 41.89456 -87.65345 41.88316 -87.65110 member
Alright. I presume you may have 100s of missing lat/lon in this data frame that you would want to fillup from a file. Where do you have the station names with known lat lon data lying? If you donot have a file and if the known lat/lon to station is a small list then we will directly paste it in code. But it is usually better to have data pulled in from a text file.
Yes, the file has something around 10 thousand entries missing data. The full file has 5860776 rows
The known info is within the same csv file, I'll need to manually collect it but since the number of stations are limited it shouldn't take too long. It should be simple once I have the missing data on a spreadsheet, then I could filter each lat lon to change the station name, the same applies filtering station to add the known lat lon, kown station lat to change the lon etc...
You had me sold on the idea of exporting the rows with misising info into a csv file to work with it on excel, why would we need the known lat lon for that?