I am trying to clean up the "Transaction Description" of my banking history so they can be grouped by store rather than having 15 different vendors like "Amazon.ca #5411", "Amazon.ca HH#523432" and so on.
Using str_sub, I've managed to clean up a large percentage of the them, but a less-than-brute-force, line-by-line (or vendor-by-vendor) solution for the balance is eluding me. The data is in character format, with extra whitespaces, numbers and symbols with no specific format.
Sample Data:
IMPARK00030305U
SP * MARILLA WALKER PA
CALG CO-OP GAS
AMAZON.CA U8R3U3
E-TRANSFER CA***csN
NATIONAL OILWEL PAY
WAL-MART #3011
ONCE UPON A CHILD CALG
MCDONALD'S #919 _F
TFR-FR 5226417
OLD COUNTRY MAR
VALLEYVIEW HUSK
ENMAX HH#002
SAFEWAY #8918
TELUS MOBILITY PREAUTH
BEACON MARTINIZ _F
SHELL C01766
CHAPTERS 964 _F
CHASIN TAILS _F
AMAZON.CA UU#001
The end goal of course is to have something that can work with new data without having to be re-written for that specific vendor each time one shows up in the records.
I initially emphasize that I do not know all the standards of your financial transaction information. Only by taking Sample Data that you have informed, maybe the solution below will provide satisfactory information for your problem.
You could first handle transaction information using Regex. After the information you have processed is sufficient, incorporate it into your table. I think this approach would be the most interesting.
Thanks @prdm0
I failed to provide the desired output here, which I suppose is key nor did I include any of my attempts.
IMPARK00030305U
MARILLA WALKER PA
CALG CO-OP GAS
AMAZON.CA
E-TRANSFER
NATIONAL OILWEL PAY
WAL-MART
ONCE UPON A CHILD CALG
MCDONALD'S
TFR-FR 5226417
OLD COUNTRY MAR
VALLEYVIEW HUSK
ENMAX
SAFEWAY
TELUS MOBILITY PREAUTH
BEACON MARTINIZ
SHELL
CHAPTERS
CHASIN TAILS
AMAZON.CA
Your str_extract solution is not unlike trials I made, but runs into the same sort of trouble. E-Transfer {transaction ID} becomes Transfer {truncated transaction ID}. Similarly, Enmax {transaction ID} becomes Enmax {truncated transaction ID}.
My goal is to be able to group transactions by vendor, without separate groups for each transaction ID, or vendor location (whether I bought fuel at the Shell on the corner or the Shell down the road makes no difference for example).
The rate at which new R packages flow into the public sphere vastly out-strips my ability to keep up with what each one does and the speed at which I become truly proficient with any one. Having said that, I don't think a regex pattern match is the answer (but happy to be corrected) as there isn't really a pattern. Some transactions include locations (Chapters 964), some include transaction IDs, some transaction IDs are numeric, some are alpha-numeric. Some vendors are helpfully named 49th Parallel!
So this goes back to my original post. I have been unable to decipher a means that isn't brute-forcing it (something like