Cleaning bank record text

I am trying to clean up the "Transaction Description" of my banking history so they can be grouped by store rather than having 15 different vendors like "Amazon.ca #5411", "Amazon.ca HH#523432" and so on.

Using str_sub, I've managed to clean up a large percentage of the them, but a less-than-brute-force, line-by-line (or vendor-by-vendor) solution for the balance is eluding me. The data is in character format, with extra whitespaces, numbers and symbols with no specific format.

Sample Data:
IMPARK00030305U
SP * MARILLA WALKER PA
CALG CO-OP GAS
AMAZON.CA U8R3U3
E-TRANSFER CA***csN
NATIONAL OILWEL PAY
WAL-MART #3011
ONCE UPON A CHILD CALG
MCDONALD'S #919 _F
TFR-FR 5226417
OLD COUNTRY MAR
VALLEYVIEW HUSK
ENMAX HH#002
SAFEWAY #8918
TELUS MOBILITY PREAUTH
BEACON MARTINIZ _F
SHELL C01766
CHAPTERS 964 _F
CHASIN TAILS _F
AMAZON.CA UU#001

The end goal of course is to have something that can work with new data without having to be re-written for that specific vendor each time one shows up in the records.

To pair down the inter-account transactions:

df_clean <- df_raw %>% mutate( 
        Transaction = case_when(
            Transaction %like% "TFR-" ~ str_sub(
                Transaction, start=6),
            TRUE ~ Transaction),
        Transaction = case_when(
            Transaction %like% "^TD ATM" ~ str_sub(
           Transaction, start = 4, end = 10),
            TRUE ~Transaction))

It does the trick for those items, but doesn't strike me as particularly flexible and not very useful for thousands of records like the above.

I initially emphasize that I do not know all the standards of your financial transaction information. Only by taking Sample Data that you have informed, maybe the solution below will provide satisfactory information for your problem.

#install.packages("stringr")

vetor <-
    c(
      "IMPARK00030305U",
      "SP * MARILLA WALKER PA",
      "CALG CO-OP GAS",
      "E-TRANSFER CA***csN",
      "NATIONAL OILWEL PAY",
      "ONCE UPON A CHILD CALG",
      "MCDONALD'S #919 _F",
      "TFR-FR 5226417",
      "OLD COUNTRY MAR",
      "VALLEYVIEW HUSK",
      "ENMAX HH#002",
      "SAFEWAY #8918",
      "TELUS MOBILITY PREAUTH",
      "BEACON MARTINIZ _F",
      "SHELL C01766",
      "CHAPTERS 964 _F",
      "CHASIN TAILS _F",
      "AMAZON.CA UU#001"
    )
stringr::str_extract(vetor, pattern = "[:alpha:]*\\s*[^[:digit:.]]\\s*[:alpha:]*\\s*[:alpha:]") 

You could first handle transaction information using Regex. After the information you have processed is sufficient, incorporate it into your table. I think this approach would be the most interesting.

Return:

> stringr::str_extract(vetor, pattern = "[:alpha:]*\\s*[^[:digit:]*[:punct:]*]\\s*[:alpha:]+\\s*[:alpha:]*[:alpha:]+") 
 [1] "IMPARK"                 " MARILLA WALKER"        "CALG CO"                "TRANSFER CA"           
 [5] "NATIONAL OILWEL PAY"    "ONCE UPON A"            "MCDONALD"               "TFR"                   
 [9] "OLD COUNTRY MAR"        "VALLEYVIEW HUSK"        "ENMAX HH"               "SAFEWAY"               
[13] "TELUS MOBILITY PREAUTH" "BEACON MARTINIZ"        "SHELL C"                "CHAPTERS"              
[17] "CHASIN TAILS"           "AMAZON"    

Note: Try to work better on the argument passed to the pattern in str_extract() function.

Best regards,
prdm0

1 Like

If you just want to remove the blanks, then do this:

> gsub(" ", "", vetor)
 [1] "IMPARK00030305U"      "SP*MARILLAWALKERPA"   "CALGCO-OPGAS"         "E-TRANSFERCA***csN"  
 [5] "NATIONALOILWELPAY"    "ONCEUPONACHILDCALG"   "MCDONALD'S#919_F"     "TFR-FR5226417"       
 [9] "OLDCOUNTRYMAR"        "VALLEYVIEWHUSK"       "ENMAXHH#002"          "SAFEWAY#8918"        
[13] "TELUSMOBILITYPREAUTH" "BEACONMARTINIZ_F"     "SHELLC01766"          "CHAPTERS964_F"       
[17] "CHASINTAILS_F"        "AMAZON.CAUU#001"     

Thanks @prdm0
I failed to provide the desired output here, which I suppose is key nor did I include any of my attempts.
IMPARK00030305U
MARILLA WALKER PA
CALG CO-OP GAS
AMAZON.CA
E-TRANSFER
NATIONAL OILWEL PAY
WAL-MART
ONCE UPON A CHILD CALG
MCDONALD'S
TFR-FR 5226417
OLD COUNTRY MAR
VALLEYVIEW HUSK
ENMAX
SAFEWAY
TELUS MOBILITY PREAUTH
BEACON MARTINIZ
SHELL
CHAPTERS
CHASIN TAILS
AMAZON.CA

Your str_extract solution is not unlike trials I made, but runs into the same sort of trouble. E-Transfer {transaction ID} becomes Transfer {truncated transaction ID}. Similarly, Enmax {transaction ID} becomes Enmax {truncated transaction ID}.

My goal is to be able to group transactions by vendor, without separate groups for each transaction ID, or vendor location (whether I bought fuel at the Shell on the corner or the Shell down the road makes no difference for example).

The rate at which new R packages flow into the public sphere vastly out-strips my ability to keep up with what each one does and the speed at which I become truly proficient with any one. Having said that, I don't think a regex pattern match is the answer (but happy to be corrected) as there isn't really a pattern. Some transactions include locations (Chapters 964), some include transaction IDs, some transaction IDs are numeric, some are alpha-numeric. Some vendors are helpfully named 49th Parallel!

So this goes back to my original post. I have been unable to decipher a means that isn't brute-forcing it (something like

mutate(Transaction = xpss::vlookup(str_sub(stringr::str_sub(Transaction,end = 10 ), Vendor_Name_Table, 1, 2)

which would still require manually editing literally hundreds of vendor names by hand to create the table. This doesn't seem smart. Or elegant.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.