Parsing txt file that are paged - Reprex included

Hi,

I'm trying to parse a dataset that is basically a TXT file that has a table structure in it but it is NOT continuous, it is paged and each page includes the headers and footnotes as well as the page number.

Is there a quick way to parse this into a single continuous data frame without having to write bespoke function that identifies where each table starts and finishes?

I've never worked with a dataset like this, hence why I trying to find a out of the box solution before trying to go nuts and create my own function to attempt to parse this dataset.

The dataset in question is below as part of the reprex it is from a US Gov website. If you do not want to access the full dataset, I've created below in the reprex the first 100 rows from the txt file.

#Load data, source is available online:
#df <- readr::read_tsv("https://www.cbp.gov/sites/default/files/assets/documents/2023-Jan/FIRMS30%20.TXT")

# If you used the above and downloaded directly from the url, if you run the below line, it will generate  the below tibble. 
#df |> dplyr::slice_head(n=100) |> datapasta::tribble_paste()

df <- tibble::tribble(
  ~X..........ARCHIVED.FIRMS.CODE...M435.......................................................................................A......0,
  "H778   AUTO ALLIANCE 01                       GIBRALTER RD AND I-75                FLAT ROCK               MI    48134     D     02",
  "H414   CABELA'S INC. 01                      501 CLIFFHAVEN RD                     PRAIRIE DU CHIEN        WI    53821     A     02",
  "LAB0   DO NOT USE                                                                  GREENSBORO              NC    27407     D      0",
  "W771   DO NOT USE                                                                  SEATTLE                 WA    98158     D     04",
  "X      DO NOT USE                                                                                                          D     01",
  "H887   HAWKER BEECHCRAFT CORPORATION 02      2625 SCANLAN AVE                      SALINA                  KS    67401     D     02",
  "P039   NACCO MATERIALS HANDLING GROUP, INC   U S HIGHWAY 278 E                     SULLIGENT               AL    35586     A     02",
  "A360   SENTRY GROUP 03                       140 DESPATCH DR                       EAST ROCHESTER          NY    14445     A     02",
  "U839   VALERO REFINING 02                    801 DOCK RD                           TEXAS CITY              TX    77590     A     02",
  "L067   WESTERN REFINING YORKTOWN, INC. 01    2201 GOODWIN NECK RD                  YORKTOWN                VA    23692     D     02",
  "K711   WIRSBO 02                             21900 DODD BLVD                       LAKEVILLE               MN    55044     D     02",
  "PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDER PROTECTION                                 PAGE:    1",
  "PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS REPORT",
  "FAC TYPE: 01=CUSTOMS CONTAINER STA  02=FOREIGN TRADE ZONE 03=PIER  04=BONDED WAREHOUSE  05=INSPECTION FACILITY 06=IMPORTER PREMISES",
  "07=DP SITE   08=CUSTOM ADMIN SITE            (FACILITY TYPE 07 AND 08 ARE NOT VALID ON ENTRY AS LOCATION OF GOODS)",
  "REGION: 1  DIST/PORT: 0101",
  "FAC",
  "FIRM   NAME                                  STREET                                CITY                    ST    ZIP     STAT   TYP",
  "----   ----                                  ------                                ----                    --    ---     ----   ---",
  "A011   A L GRIFFIN INC                       8 N KELSEY ST                         SOUTH PORTLAND          ME    04106     A     06",
  "C556   ABF FREIGHT SYSTEMS INC               356 RIVERSIDE INDUSTRIAL PKWY         PORTLAND                ME    04103     A     05",
  "C608   AIRBORNE EXPRESS                      9 JOHNSON RD                          PORTLAND                ME    04102     D     05",
  "C664   AIRBORNE EXPRESS                      53 DARLING AVE                        SOUTH PORTLAND          ME    04106     D     05",
  "A038   AMERICOLD CORPORATION                 165 READ ST                           PORTLAND                ME    04103     A     05",
  "A032   BATH IRON WORKS                       700 WASHINGTON ST                     BATH                    ME    04530     A     03",
  "C877   BOSTON BRANDS OF MAINE                21 SARATOGA ST                        LEWISTON                ME    04240     D     04",
  "A010   BROWN SHIP SERVICES                   38 UNION WHARF                        PORTLAND                ME    04101     A     05",
  "A031   BRUNSWICK EXECUTIVE AIRPORT           15 TERMINAL RD                        BRUNSWICK               ME    04011     A     05",
  "B764   BSP TRANSPORTION INC                  65 EISENHOWER DR                      WESTBROOK               ME    04092     A     05",
  "A020   BUCKEYE OIL TERMINAL                  170 LINCOLN ST                        SOUTH PORTLAND          ME    04106     A     03",
  "A022   CENTRAL TRANSPORT, INC.               1 OLD BRUNSWICK RD                    GARDINER                ME    04345     D     06",
  "A007   CITGO PETROLEUM CORPORATION           102 MECHANIC ST                       SOUTH PORTLAND          ME    04106     A     03",
  "C942   CONSOLIDATED FREIGHTWAYS              9 GINN RD                             SCARBOROUGH             ME    04074     D     06",
  "B767   CUSTOMS RAMP @KPWM                     YELLOW BIRD RD                       PORTLAND                ME    04101     A     08",
  "B762   DELTA AIRFREIGHT                       PORTLAND INT'L JETPORT               PORTLAND                ME    04101     A     05",
  "AAV3   EIMSKIP LOGISTICS                     468 COMMERCIAL ST                     PORTLAND                ME    04101     A     05",
  "A319   EIMSKIP LOGISTICS, INC                468 COMMERCIAL ST                     PORTLAND                ME    04101     A     05",
  "D122   ESTES EXPRESSLINES                    400 RIVER RD                          LEWISTON                ME    04240     A     05",
  "B801   FEDERAL EXPRESS CORPORATION           261 YELLOWBIRD RD                     PORTLAND                ME    04102     A     05",
  "D041   FEDEX FREIGHT                         236 PRESUMPSCOT ST                    PORTLAND                ME    04103     A     05",
  "AAK4   FLEMISH MASTER WEAVERS 01             96 GATE HOUSE RD                      SANFORD                 ME    04073     A     02",
  "A016   FLOATING FLEET LTD.                   468 COMMERCIAL ST                     PORTLAND                ME    04101     A     05",
  "A006   GLOBAL COMPANIES LLC                  1 CLARK RD                            SOUTH PORTLAND          ME    04106     A     03",
  "A004   GULF OIL LIMITED TERMINAL             175 FRONT ST                          SOUTH PORTLAND          ME    04106     A     03",
  "C464   IDEXX LABORATORIES, INC.              1 IDEXX DR                            WESTBROOK               ME    04092     A     06",
  "A015   INDUSTRIAL WELDING & MACHINE          430 COMMERCIAL ST                     PORTLAND                ME    04101     D     03",
  "B760   INTERNATIONAL MARINE TERMINAL         468 COMMERCIAL ST                     PORTLAND                ME    04101     A     03",
  "C557   ITO                                   466 COMMERCIAL ST                     PORTLAND                ME    04101     D     05",
  "A073   JOTUL NORTH AMERICA, INC              55 HUTCHERSON DR                      GORHAM                  ME    04038     D     04",
  "C772   LAND AIR EXPRESS                      9 GINN RD                             SCARBOROUGH             ME    04074     A     05",
  "AAA7   LL BEAN DROP TRAILER YARD             57 KATAHDIN DR                        BRUNSWICK               ME    04011     A     06",
  "AAA4   LL BEAN PRIMARY WAREHOUSE             5 CAMPUS DR                           FREEPORT                ME    04033     A     05",
  "B768   MAC JETS                              100 AVIATION BLVD                     SOUTH PORTLAND          ME    04106     A     05",
  "AAL9   MAINE COAST SHELLFISH, LLC 01         15 HANNAFORD DR                       YORK                    ME    03909     A     02",
  "D232   NEPW LOGISTICS                        140 RODMAN RD                         AUBURN                  ME    04210     A     05",
  "C988   NEW ENGLAND MOTOR FREIGHT             7 MANSON LIBBY RD                     SCARBOROUGH             ME    04074     A     05",
  "C555   NORTHEAST AIR                         1011 WESTBROOK ST                     PORTLAND                ME    04102     A     05",
  "A035   OCEAN GATEWAY TERMINAL                40 COMMERCIAL ST                      PORTLAND                ME    04101     A     05",
  "D301   OLD DOMINION FREIGHT LINE             185 RAND RD                           PORTLAND                ME    04102     A     05",
  "B766   PALCO AIR CARGO                       10 WILLEY RD                          SACO                    ME    04072     A     05",
  "B763   PORTLAND AIR FREIGHT INC              75 POSTAL SERVICE WAY                 SCARBOROUGH             ME    04074     A     05",
  "C582   PRESTON TRUCKING                      4 GINN RD                             SCARBOROUGH             ME    04074     D     05",
  "AAJ1   READY TUBING LLC                      350 PINE POINT RD                     SCARBOROUGH             ME    04074     D     02",
  "PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDER PROTECTION                                 PAGE:    2",
  "PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS REPORT",
  "REGION: 1  DIST/PORT: 0101",
  "FAC",
  "FIRM   NAME                                  STREET                                CITY                    ST    ZIP     STAT   TYP",
  "----   ----                                  ------                                ----                    --    ---     ----   ---",
  "B759   RED STAR EXPRESS                       TERMINAL WAY                         SOUTH PORTLAND          ME    04106     D     05",
  "A025   ROADWAY EXPRESS                        BRADLEY RD                           WESTBROOK               ME    04092     D     05",
  "D254   ROADWAY EXPRESS (AUGUSTA, ME)         61 TWIN RD                            AUBURN                  ME    04210     A     05",
  "C728   ROADWAY GLOBAL AIR, INC.              236 PRESUMPSCOT ST                    PORTLAND                ME    04103     D     05",
  "B761   SPRAGUE PORTLAND TERMINAL             92 CASSIDY POINT DR                   PORTLAND                ME    04102     A     03",
  "A008   SPRAGUE ROLLING MILLS TERMINAL        59 MAIN ST                            SOUTH PORTLAND          ME    04106     A     03",
  "D233   ST LAWRENCE & ATLANTIC RR CO.         560 LEWISTON JUNCTION RD              AUBURN                  ME    04210     A     05",
  "C554   UNITED AIRLINES                        PORTLAND INT'L JETPORT               PORTLAND                ME    04101     A     05",
  "D317   UPS GROUND FREIGHT INC                80 PLEASANT HILL RD                   SCARBOROUGH             ME    04074     A     05",
  "B765   UPS SUPPLY CHAIN SOLUTIONS, INC.       470 RIVERSIDE STREET                 WESTBROOK               ME    04092     D     05",
  "A001   US CBP OFFICE                         155 GANNETT DR                        SOUTH PORTLAND          ME    04106     A     08",
  "A030   WYMAN STATION                         677 COUSINS ST                        YARMOUTH                ME    04096     A     03",
  "C848   XPO LOGISTICS FREIGHT INC             7 GINN RD                             SCARBOROUGH             ME    04074     A     05",
  "A021   YRC FREIGHT                           75 EISENHOWER DR                      WESTBROOK               ME    04092     A     05",
  "PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDER PROTECTION                                 PAGE:    3",
  "PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS REPORT",
  "REGION: 1  DIST/PORT: 0102",
  "FAC",
  "FIRM   NAME                                  STREET                                CITY                    ST    ZIP     STAT   TYP",
  "----   ----                                  ------                                ----                    --    ---     ----   ---",
  "A057   AIR NATIONAL GUARD                     BANGOR INTERNATIONAL AIRPORT         BANGOR                  ME    04401     D     05",
  "A046   BAILEY'S TOTAL MOVING CNTR            6 STATE ST                            BREWER                  ME    04412     D     04",
  "C729   BSP TRANSPORT                         1 AMMO INDUSTRIAL PARK                BANGOR                  ME    04401     D     04",
  "D081   COMAIR                                298 GODFREY BLVD                      BANGOR                  ME    04401     D     05",
  "D044   FEDEX FREIGHT INC.                    54-56 GODSOE RD                       BANGOR                  ME    04401     D     05",
  "A043   FOX GINN MOVING & STORAGE CO          195 THATCHER ST                       BANGOR                  ME    04401     D     04",
  "B995   FTZ 58                                 BUILDING 271 FLORIDA AVE             BANGOR                  ME    04401     D     02",
  "A053   JERREY'S CATERING BGR                 61 FLORIDA AVE                        BANGOR                  ME    04401     D     06",
  "C001   PORTLAND AIR FREIGHT                  33 PERRY RD                           BANGOR                  ME    04401     D     05",
  "C860   ROADWAY EXPRESS INC                   12 FREEDOM PKWY                       BANGOR                  ME    04401     D     05",
  "A058   U S POST OFFICE                       202 HARLOW ST                         BANGOR                  ME    04401     D     08"
)
tibble::tribble(
  ~X..........ARCHIVED.FIRMS.CODE...M435.......................................................................................A......0,
  "H778   AUTO ALLIANCE 01                       GIBRALTER RD AND I-75                FLAT ROCK               MI    48134     D     02",
  "H414   CABELA'S INC. 01                      501 CLIFFHAVEN RD                     PRAIRIE DU CHIEN        WI    53821     A     02",
  "LAB0   DO NOT USE                                                                  GREENSBORO              NC    27407     D      0",
  "W771   DO NOT USE                                                                  SEATTLE                 WA    98158     D     04",
  "X      DO NOT USE                                                                                                          D     01",
  "H887   HAWKER BEECHCRAFT CORPORATION 02      2625 SCANLAN AVE                      SALINA                  KS    67401     D     02",
  "P039   NACCO MATERIALS HANDLING GROUP, INC   U S HIGHWAY 278 E                     SULLIGENT               AL    35586     A     02",
  "A360   SENTRY GROUP 03                       140 DESPATCH DR                       EAST ROCHESTER          NY    14445     A     02",
  "U839   VALERO REFINING 02                    801 DOCK RD                           TEXAS CITY              TX    77590     A     02",
  "L067   WESTERN REFINING YORKTOWN, INC. 01    2201 GOODWIN NECK RD                  YORKTOWN                VA    23692     D     02",
  "K711   WIRSBO 02                             21900 DODD BLVD                       LAKEVILLE               MN    55044     D     02",
  "PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDER PROTECTION                                 PAGE:    1",
  "PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS REPORT",
  "FAC TYPE: 01=CUSTOMS CONTAINER STA  02=FOREIGN TRADE ZONE 03=PIER  04=BONDED WAREHOUSE  05=INSPECTION FACILITY 06=IMPORTER PREMISES",
  "07=DP SITE   08=CUSTOM ADMIN SITE            (FACILITY TYPE 07 AND 08 ARE NOT VALID ON ENTRY AS LOCATION OF GOODS)",
  "REGION: 1  DIST/PORT: 0101",
  "FAC",
  "FIRM   NAME                                  STREET                                CITY                    ST    ZIP     STAT   TYP",
  "----   ----                                  ------                                ----                    --    ---     ----   ---",
  "A011   A L GRIFFIN INC                       8 N KELSEY ST                         SOUTH PORTLAND          ME    04106     A     06",
  "C556   ABF FREIGHT SYSTEMS INC               356 RIVERSIDE INDUSTRIAL PKWY         PORTLAND                ME    04103     A     05",
  "C608   AIRBORNE EXPRESS                      9 JOHNSON RD                          PORTLAND                ME    04102     D     05",
  "C664   AIRBORNE EXPRESS                      53 DARLING AVE                        SOUTH PORTLAND          ME    04106     D     05",
  "A038   AMERICOLD CORPORATION                 165 READ ST                           PORTLAND                ME    04103     A     05",
  "A032   BATH IRON WORKS                       700 WASHINGTON ST                     BATH                    ME    04530     A     03",
  "C877   BOSTON BRANDS OF MAINE                21 SARATOGA ST                        LEWISTON                ME    04240     D     04",
  "A010   BROWN SHIP SERVICES                   38 UNION WHARF                        PORTLAND                ME    04101     A     05",
  "A031   BRUNSWICK EXECUTIVE AIRPORT           15 TERMINAL RD                        BRUNSWICK               ME    04011     A     05",
  "B764   BSP TRANSPORTION INC                  65 EISENHOWER DR                      WESTBROOK               ME    04092     A     05",
  "A020   BUCKEYE OIL TERMINAL                  170 LINCOLN ST                        SOUTH PORTLAND          ME    04106     A     03",
  "A022   CENTRAL TRANSPORT, INC.               1 OLD BRUNSWICK RD                    GARDINER                ME    04345     D     06",
  "A007   CITGO PETROLEUM CORPORATION           102 MECHANIC ST                       SOUTH PORTLAND          ME    04106     A     03",
  "C942   CONSOLIDATED FREIGHTWAYS              9 GINN RD                             SCARBOROUGH             ME    04074     D     06",
  "B767   CUSTOMS RAMP @KPWM                     YELLOW BIRD RD                       PORTLAND                ME    04101     A     08",
  "B762   DELTA AIRFREIGHT                       PORTLAND INT'L JETPORT               PORTLAND                ME    04101     A     05",
  "AAV3   EIMSKIP LOGISTICS                     468 COMMERCIAL ST                     PORTLAND                ME    04101     A     05",
  "A319   EIMSKIP LOGISTICS, INC                468 COMMERCIAL ST                     PORTLAND                ME    04101     A     05",
  "D122   ESTES EXPRESSLINES                    400 RIVER RD                          LEWISTON                ME    04240     A     05",
  "B801   FEDERAL EXPRESS CORPORATION           261 YELLOWBIRD RD                     PORTLAND                ME    04102     A     05",
  "D041   FEDEX FREIGHT                         236 PRESUMPSCOT ST                    PORTLAND                ME    04103     A     05",
  "AAK4   FLEMISH MASTER WEAVERS 01             96 GATE HOUSE RD                      SANFORD                 ME    04073     A     02",
  "A016   FLOATING FLEET LTD.                   468 COMMERCIAL ST                     PORTLAND                ME    04101     A     05",
  "A006   GLOBAL COMPANIES LLC                  1 CLARK RD                            SOUTH PORTLAND          ME    04106     A     03",
  "A004   GULF OIL LIMITED TERMINAL             175 FRONT ST                          SOUTH PORTLAND          ME    04106     A     03",
  "C464   IDEXX LABORATORIES, INC.              1 IDEXX DR                            WESTBROOK               ME    04092     A     06",
  "A015   INDUSTRIAL WELDING & MACHINE          430 COMMERCIAL ST                     PORTLAND                ME    04101     D     03",
  "B760   INTERNATIONAL MARINE TERMINAL         468 COMMERCIAL ST                     PORTLAND                ME    04101     A     03",
  "C557   ITO                                   466 COMMERCIAL ST                     PORTLAND                ME    04101     D     05",
  "A073   JOTUL NORTH AMERICA, INC              55 HUTCHERSON DR                      GORHAM                  ME    04038     D     04",
  "C772   LAND AIR EXPRESS                      9 GINN RD                             SCARBOROUGH             ME    04074     A     05"
)
#> # A tibble: 50 × 1
#>    X..........ARCHIVED.FIRMS.CODE...M435......................................…¹
#>    <chr>                                                                        
#>  1 H778   AUTO ALLIANCE 01                       GIBRALTER RD AND I-75         …
#>  2 H414   CABELA'S INC. 01                      501 CLIFFHAVEN RD              …
#>  3 LAB0   DO NOT USE                                                           …
#>  4 W771   DO NOT USE                                                           …
#>  5 X      DO NOT USE                                                           …
#>  6 H887   HAWKER BEECHCRAFT CORPORATION 02      2625 SCANLAN AVE               …
#>  7 P039   NACCO MATERIALS HANDLING GROUP, INC   U S HIGHWAY 278 E              …
#>  8 A360   SENTRY GROUP 03                       140 DESPATCH DR                …
#>  9 U839   VALERO REFINING 02                    801 DOCK RD                    …
#> 10 L067   WESTERN REFINING YORKTOWN, INC. 01    2201 GOODWIN NECK RD           …
#> # ℹ 40 more rows
#> # ℹ abbreviated name:
#> #   ¹​X..........ARCHIVED.FIRMS.CODE...M435.......................................................................................A......0

##################################################
##################################################

# If you ran the above you created a subset example of the whole file, which should hint to the problem at hand.  By running the below, you can take a look on how the data looks when we load it to a tibble directly from the url.

df |> print(n=Inf)

#> # A tibble: 100 × 1
#>     X..........ARCHIVED.FIRMS.CODE...M435.....................................…¹
#>     <chr>                                                                       
#>   1 H778   AUTO ALLIANCE 01                       GIBRALTER RD AND I-75        …
#>   2 H414   CABELA'S INC. 01                      501 CLIFFHAVEN RD             …
#>   3 LAB0   DO NOT USE                                                          …
#>   4 W771   DO NOT USE                                                          …
#>   5 X      DO NOT USE                                                          …
#>   6 H887   HAWKER BEECHCRAFT CORPORATION 02      2625 SCANLAN AVE              …
#>   7 P039   NACCO MATERIALS HANDLING GROUP, INC   U S HIGHWAY 278 E             …
#>   8 A360   SENTRY GROUP 03                       140 DESPATCH DR               …
#>   9 U839   VALERO REFINING 02                    801 DOCK RD                   …
#>  10 L067   WESTERN REFINING YORKTOWN, INC. 01    2201 GOODWIN NECK RD          …
#>  11 K711   WIRSBO 02                             21900 DODD BLVD               …
#>  12 PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDE…
#>  13 PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS …
#>  14 FAC TYPE: 01=CUSTOMS CONTAINER STA  02=FOREIGN TRADE ZONE 03=PIER  04=BONDE…
#>  15 07=DP SITE   08=CUSTOM ADMIN SITE            (FACILITY TYPE 07 AND 08 ARE N…
#>  16 REGION: 1  DIST/PORT: 0101                                                  
#>  17 FAC                                                                         
#>  18 FIRM   NAME                                  STREET                        …
#>  19 ----   ----                                  ------                        …
#>  20 A011   A L GRIFFIN INC                       8 N KELSEY ST                 …
#>  21 C556   ABF FREIGHT SYSTEMS INC               356 RIVERSIDE INDUSTRIAL PKWY …
#>  22 C608   AIRBORNE EXPRESS                      9 JOHNSON RD                  …
#>  23 C664   AIRBORNE EXPRESS                      53 DARLING AVE                …
#>  24 A038   AMERICOLD CORPORATION                 165 READ ST                   …
#>  25 A032   BATH IRON WORKS                       700 WASHINGTON ST             …
#>  26 C877   BOSTON BRANDS OF MAINE                21 SARATOGA ST                …
#>  27 A010   BROWN SHIP SERVICES                   38 UNION WHARF                …
#>  28 A031   BRUNSWICK EXECUTIVE AIRPORT           15 TERMINAL RD                …
#>  29 B764   BSP TRANSPORTION INC                  65 EISENHOWER DR              …
#>  30 A020   BUCKEYE OIL TERMINAL                  170 LINCOLN ST                …
#>  31 A022   CENTRAL TRANSPORT, INC.               1 OLD BRUNSWICK RD            …
#>  32 A007   CITGO PETROLEUM CORPORATION           102 MECHANIC ST               …
#>  33 C942   CONSOLIDATED FREIGHTWAYS              9 GINN RD                     …
#>  34 B767   CUSTOMS RAMP @KPWM                     YELLOW BIRD RD               …
#>  35 B762   DELTA AIRFREIGHT                       PORTLAND INT'L JETPORT       …
#>  36 AAV3   EIMSKIP LOGISTICS                     468 COMMERCIAL ST             …
#>  37 A319   EIMSKIP LOGISTICS, INC                468 COMMERCIAL ST             …
#>  38 D122   ESTES EXPRESSLINES                    400 RIVER RD                  …
#>  39 B801   FEDERAL EXPRESS CORPORATION           261 YELLOWBIRD RD             …
#>  40 D041   FEDEX FREIGHT                         236 PRESUMPSCOT ST            …
#>  41 AAK4   FLEMISH MASTER WEAVERS 01             96 GATE HOUSE RD              …
#>  42 A016   FLOATING FLEET LTD.                   468 COMMERCIAL ST             …
#>  43 A006   GLOBAL COMPANIES LLC                  1 CLARK RD                    …
#>  44 A004   GULF OIL LIMITED TERMINAL             175 FRONT ST                  …
#>  45 C464   IDEXX LABORATORIES, INC.              1 IDEXX DR                    …
#>  46 A015   INDUSTRIAL WELDING & MACHINE          430 COMMERCIAL ST             …
#>  47 B760   INTERNATIONAL MARINE TERMINAL         468 COMMERCIAL ST             …
#>  48 C557   ITO                                   466 COMMERCIAL ST             …
#>  49 A073   JOTUL NORTH AMERICA, INC              55 HUTCHERSON DR              …
#>  50 C772   LAND AIR EXPRESS                      9 GINN RD                     …
#>  51 AAA7   LL BEAN DROP TRAILER YARD             57 KATAHDIN DR                …
#>  52 AAA4   LL BEAN PRIMARY WAREHOUSE             5 CAMPUS DR                   …
#>  53 B768   MAC JETS                              100 AVIATION BLVD             …
#>  54 AAL9   MAINE COAST SHELLFISH, LLC 01         15 HANNAFORD DR               …
#>  55 D232   NEPW LOGISTICS                        140 RODMAN RD                 …
#>  56 C988   NEW ENGLAND MOTOR FREIGHT             7 MANSON LIBBY RD             …
#>  57 C555   NORTHEAST AIR                         1011 WESTBROOK ST             …
#>  58 A035   OCEAN GATEWAY TERMINAL                40 COMMERCIAL ST              …
#>  59 D301   OLD DOMINION FREIGHT LINE             185 RAND RD                   …
#>  60 B766   PALCO AIR CARGO                       10 WILLEY RD                  …
#>  61 B763   PORTLAND AIR FREIGHT INC              75 POSTAL SERVICE WAY         …
#>  62 C582   PRESTON TRUCKING                      4 GINN RD                     …
#>  63 AAJ1   READY TUBING LLC                      350 PINE POINT RD             …
#>  64 PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDE…
#>  65 PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS …
#>  66 REGION: 1  DIST/PORT: 0101                                                  
#>  67 FAC                                                                         
#>  68 FIRM   NAME                                  STREET                        …
#>  69 ----   ----                                  ------                        …
#>  70 B759   RED STAR EXPRESS                       TERMINAL WAY                 …
#>  71 A025   ROADWAY EXPRESS                        BRADLEY RD                   …
#>  72 D254   ROADWAY EXPRESS (AUGUSTA, ME)         61 TWIN RD                    …
#>  73 C728   ROADWAY GLOBAL AIR, INC.              236 PRESUMPSCOT ST            …
#>  74 B761   SPRAGUE PORTLAND TERMINAL             92 CASSIDY POINT DR           …
#>  75 A008   SPRAGUE ROLLING MILLS TERMINAL        59 MAIN ST                    …
#>  76 D233   ST LAWRENCE & ATLANTIC RR CO.         560 LEWISTON JUNCTION RD      …
#>  77 C554   UNITED AIRLINES                        PORTLAND INT'L JETPORT       …
#>  78 D317   UPS GROUND FREIGHT INC                80 PLEASANT HILL RD           …
#>  79 B765   UPS SUPPLY CHAIN SOLUTIONS, INC.       470 RIVERSIDE STREET         …
#>  80 A001   US CBP OFFICE                         155 GANNETT DR                …
#>  81 A030   WYMAN STATION                         677 COUSINS ST                …
#>  82 C848   XPO LOGISTICS FREIGHT INC             7 GINN RD                     …
#>  83 A021   YRC FREIGHT                           75 EISENHOWER DR              …
#>  84 PROCESSING DATE: 12/30/22                            U.S. CUSTOMS AND BORDE…
#>  85 PROCESSING TIME: 21:00:07                                 PUBLIC ACS FIRMS …
#>  86 REGION: 1  DIST/PORT: 0102                                                  
#>  87 FAC                                                                         
#>  88 FIRM   NAME                                  STREET                        …
#>  89 ----   ----                                  ------                        …
#>  90 A057   AIR NATIONAL GUARD                     BANGOR INTERNATIONAL AIRPORT …
#>  91 A046   BAILEY'S TOTAL MOVING CNTR            6 STATE ST                    …
#>  92 C729   BSP TRANSPORT                         1 AMMO INDUSTRIAL PARK        …
#>  93 D081   COMAIR                                298 GODFREY BLVD              …
#>  94 D044   FEDEX FREIGHT INC.                    54-56 GODSOE RD               …
#>  95 A043   FOX GINN MOVING & STORAGE CO          195 THATCHER ST               …
#>  96 B995   FTZ 58                                 BUILDING 271 FLORIDA AVE     …
#>  97 A053   JERREY'S CATERING BGR                 61 FLORIDA AVE                …
#>  98 C001   PORTLAND AIR FREIGHT                  33 PERRY RD                   …
#>  99 C860   ROADWAY EXPRESS INC                   12 FREEDOM PKWY               …
#> 100 A058   U S POST OFFICE                       202 HARLOW ST                 …
#> # ℹ abbreviated name:
#> #   ¹​X..........ARCHIVED.FIRMS.CODE...M435.......................................................................................A......0

Created on 2024-08-28 with reprex v2.1.1

Any hints or approaches are welcome.
Thank you for your time in advanced.
Best regards,
LF.

After some digging I concluded that there really wasn't an "out of the box" solution to parse this type of file, therefore I proceeded to handle it myself.

If you are lucky enough to find this thread for this same dataset, here is my solution to the problem.

library(tidyverse)
library(stringr)
# Reads file line by line:
lines <- read_lines("https://www.cbp.gov/sites/default/files/assets/documents/2023-Jan/FIRMS30%20.TXT")

# Create REGEX patter to detect the lines I want to exclude:
header_footer_patterns <- "^\\s*(PROCESSING DATE|U\\.S\\. CUSTOMS AND BORDER PROTECTION|PAGE:|\\*\\*\\*ARCHIVED\\s+FIRMS\\s+CODE|FAC TYPE:|REGION:|PROCESSING TIME:|\\s{15,}|\\s*FAC|----|\\*\\*\\*\\s+\\*\\*\\*ARCHIVED\\s+FIRMS\\s+CODE\\s+:\\s+M435|\\*\\*\\*\\*\\*\\s+REPORT\\s+UPDATED\\s+ON|\\032|^$)"
# Create a subset of desired lines only:
filtered_lines <- lines[!str_detect(lines, header_footer_patterns)]
# I notice despite part of the REGEX, the "table headers" are still present, but the rest was excluded,
# So I wrote a new REGEX patter to address the removal of table headers exclusively:
firm_header_pattern <- "^\\s*FIRM\\s+NAME\\s+STREET\\s+CITY\\s+ST\\s+ZIP\\s+STAT\\s+TYP"
# Apply str_detect to extract matching lines:
extracted_lines <- filtered_lines[!str_detect(filtered_lines, firm_header_pattern)]
# After several different approaches to try and delimit the columns using REGEX, I abandon that approach and exported
# the extracted_lines as a TXT file and opened it in a text editor where I could clearly see where each column started
# and finished. Once I knew the starting and end position for each column, I proceeded to create a tibble:
parsed_data <- tibble(
  Firm_Code = str_sub(extracted_lines, 1, 5),
  Name = str_sub(extracted_lines, 6, 46),
  Street = str_sub(extracted_lines, 47, 84),
  City = str_sub(extracted_lines, 85, 108),
  State = str_sub(extracted_lines, 109, 111),
  Zip = str_sub(extracted_lines, 115, 124),
  Status = str_sub(extracted_lines, 125, 130),
  Type = str_sub(extracted_lines, 131, 133)
)

# Trim parsed data to remove unnecessary white spaces:
parsed_data <- parsed_data |>
  mutate(across(everything(), str_trim)) 
# Checked the final structure of the parsed_date
parsed_data |> str()
#> tibble [20,547 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ Firm_Code: chr [1:20547] "H778" "H414" "LAB0" "W771" ...
#>  $ Name     : chr [1:20547] "AUTO ALLIANCE 01" "CABELA'S INC. 01" "DO NOT USE" "DO NOT USE" ...
#>  $ Street   : chr [1:20547] "GIBRALTER RD AND I-75" "501 CLIFFHAVEN RD" "" "" ...
#>  $ City     : chr [1:20547] "FLAT ROCK" "PRAIRIE DU CHIEN" "GREENSBORO" "SEATTLE" ...
#>  $ State    : chr [1:20547] "MI" "WI" "NC" "WA" ...
#>  $ Zip      : chr [1:20547] "48134" "53821" "27407" "98158" ...
#>  $ Status   : chr [1:20547] "D" "A" "D" "D" ...
#>  $ Type     : chr [1:20547] "02" "02" "0" "04" ...
# I then exported the parsed_data as a tab delimited txt file 
write_tsv(parsed_data,"~\\parsed_data_20240828.txt")

Created on 2024-08-28 with reprex v2.1.1

Hope this helps someone in the future.

Good luck,
LF.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.