Cleaning bank record text

andrew2 · June 26, 2019, 3:03am

I am trying to clean up the "Transaction Description" of my banking history so they can be grouped by store rather than having 15 different vendors like "Amazon.ca #5411", "Amazon.ca HH#523432" and so on.

Using str_sub, I've managed to clean up a large percentage of the them, but a less-than-brute-force, line-by-line (or vendor-by-vendor) solution for the balance is eluding me. The data is in character format, with extra whitespaces, numbers and symbols with no specific format.

Sample Data:
IMPARK00030305U
SP * MARILLA WALKER PA
CALG CO-OP GAS
AMAZON.CA U8R3U3
E-TRANSFER CA***csN
NATIONAL OILWEL PAY
WAL-MART #3011
ONCE UPON A CHILD CALG
MCDONALD'S #919 _F
TFR-FR 5226417
OLD COUNTRY MAR
VALLEYVIEW HUSK
ENMAX HH#002
SAFEWAY #8918
TELUS MOBILITY PREAUTH
BEACON MARTINIZ _F
SHELL C01766
CHAPTERS 964 _F
CHASIN TAILS _F
AMAZON.CA UU#001

The end goal of course is to have something that can work with new data without having to be re-written for that specific vendor each time one shows up in the records.

To pair down the inter-account transactions:

df_clean <- df_raw %>% mutate( 
        Transaction = case_when(
            Transaction %like% "TFR-" ~ str_sub(
                Transaction, start=6),
            TRUE ~ Transaction),
        Transaction = case_when(
            Transaction %like% "^TD ATM" ~ str_sub(
           Transaction, start = 4, end = 10),
            TRUE ~Transaction))

It does the trick for those items, but doesn't strike me as particularly flexible and not very useful for thousands of records like the above.

prdm0 · June 27, 2019, 11:22am

I initially emphasize that I do not know all the standards of your financial transaction information. Only by taking Sample Data that you have informed, maybe the solution below will provide satisfactory information for your problem.

#install.packages("stringr")

vetor <-
    c(
      "IMPARK00030305U",
      "SP * MARILLA WALKER PA",
      "CALG CO-OP GAS",
      "E-TRANSFER CA***csN",
      "NATIONAL OILWEL PAY",
      "ONCE UPON A CHILD CALG",
      "MCDONALD'S #919 _F",
      "TFR-FR 5226417",
      "OLD COUNTRY MAR",
      "VALLEYVIEW HUSK",
      "ENMAX HH#002",
      "SAFEWAY #8918",
      "TELUS MOBILITY PREAUTH",
      "BEACON MARTINIZ _F",
      "SHELL C01766",
      "CHAPTERS 964 _F",
      "CHASIN TAILS _F",
      "AMAZON.CA UU#001"
    )
stringr::str_extract(vetor, pattern = "[:alpha:]*\\s*[^[:digit:.]]\\s*[:alpha:]*\\s*[:alpha:]")

You could first handle transaction information using Regex. After the information you have processed is sufficient, incorporate it into your table. I think this approach would be the most interesting.

Return:

> stringr::str_extract(vetor, pattern = "[:alpha:]*\\s*[^[:digit:]*[:punct:]*]\\s*[:alpha:]+\\s*[:alpha:]*[:alpha:]+") 
 [1] "IMPARK"                 " MARILLA WALKER"        "CALG CO"                "TRANSFER CA"           
 [5] "NATIONAL OILWEL PAY"    "ONCE UPON A"            "MCDONALD"               "TFR"                   
 [9] "OLD COUNTRY MAR"        "VALLEYVIEW HUSK"        "ENMAX HH"               "SAFEWAY"               
[13] "TELUS MOBILITY PREAUTH" "BEACON MARTINIZ"        "SHELL C"                "CHAPTERS"              
[17] "CHASIN TAILS"           "AMAZON"

Note: Try to work better on the argument passed to the pattern in str_extract() function.

Best regards,
prdm0

prdm0 · June 27, 2019, 11:50am

If you just want to remove the blanks, then do this:

> gsub(" ", "", vetor)
 [1] "IMPARK00030305U"      "SP*MARILLAWALKERPA"   "CALGCO-OPGAS"         "E-TRANSFERCA***csN"  
 [5] "NATIONALOILWELPAY"    "ONCEUPONACHILDCALG"   "MCDONALD'S#919_F"     "TFR-FR5226417"       
 [9] "OLDCOUNTRYMAR"        "VALLEYVIEWHUSK"       "ENMAXHH#002"          "SAFEWAY#8918"        
[13] "TELUSMOBILITYPREAUTH" "BEACONMARTINIZ_F"     "SHELLC01766"          "CHAPTERS964_F"       
[17] "CHASINTAILS_F"        "AMAZON.CAUU#001"

andrew2 · June 29, 2019, 4:18am

Thanks @prdm0
I failed to provide the desired output here, which I suppose is key nor did I include any of my attempts.
IMPARK00030305U
MARILLA WALKER PA
CALG CO-OP GAS
AMAZON.CA
E-TRANSFER
NATIONAL OILWEL PAY
WAL-MART
ONCE UPON A CHILD CALG
MCDONALD'S
TFR-FR 5226417
OLD COUNTRY MAR
VALLEYVIEW HUSK
ENMAX
SAFEWAY
TELUS MOBILITY PREAUTH
BEACON MARTINIZ
SHELL
CHAPTERS
CHASIN TAILS
AMAZON.CA

Your str_extract solution is not unlike trials I made, but runs into the same sort of trouble. E-Transfer {transaction ID} becomes Transfer {truncated transaction ID}. Similarly, Enmax {transaction ID} becomes Enmax {truncated transaction ID}.

My goal is to be able to group transactions by vendor, without separate groups for each transaction ID, or vendor location (whether I bought fuel at the Shell on the corner or the Shell down the road makes no difference for example).

The rate at which new R packages flow into the public sphere vastly out-strips my ability to keep up with what each one does and the speed at which I become truly proficient with any one. Having said that, I don't think a regex pattern match is the answer (but happy to be corrected) as there isn't really a pattern. Some transactions include locations (Chapters 964), some include transaction IDs, some transaction IDs are numeric, some are alpha-numeric. Some vendors are helpfully named 49th Parallel!

So this goes back to my original post. I have been unable to decipher a means that isn't brute-forcing it (something like

mutate(Transaction = xpss::vlookup(str_sub(stringr::str_sub(Transaction,end = 10 ), Vendor_Name_Table, 1, 2)

which would still require manually editing literally hundreds of vendor names by hand to create the table. This doesn't seem smart. Or elegant.

system · July 20, 2019, 4:18am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.