Dates and time are always a pain,
look into:
?as.Date
Please use the original dput structure I posted for this following question:
The first column where you see GO-2019770786 is the event_unique_id, although it says unique I see duplicates. I understand that one event can have multiple offences i,e, MCI categories in the dataset and those will not be duplicates. However, I found the duplicate event ids with the same MCI for some records. In this case, how would I drop the duplicates.
I am not sure how to proceed here.
The first two records are duplicates whereas the last two are not.
event_unique_id | premisetype | ucr_code | ucr_ext | offence | MCI |
---|---|---|---|---|---|
GO-20141262553 | Other | 1430 | 100 | Assault | Assault |
GO-20141262553 | Other | 1430 | 100 | Assault | Assault |
GO-20141296470 | Commercial | 2120 | 200 | B&E | Break and Enter |
GO-20141296470 | Commercial | 1480 | 100 | Assault - Resist/ Prevent Seiz | Assault |
If you want the data set to have a data frame with no duplicated rows, you can use the unique() function. If the data frame is named DF
DF_uniq <- unique(DF)
Thank you so much. This works.
If I use tree based algorithms - say decision tree or random forest to train, is integer/label encoding enough for these variables in the dataset? - Integer encoding/label encoding for premise types, occ month, occ day of week, neighbourhood? One hot encoding will be required only if I use other algorithms? Also lat/lon can be as is for these tree based algorithms? Sorry for a lot of these questions. Any help will be appreciated.
Thanks,
Please start a new thread for this question. It is very different than your initial question and a new thread will be much more likely to attract someone with the right knowledge.
Okay. Sure. Thanks for your help. I just posted it.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.