INDEX
2024-03-12 12:00 PDF Parser I wrote a basic bash script to parse HSBC bank statements, at the request of my friend They're structured in a fairly difficult, inconsistent way but did most of it using awk # Use "pdftotext" to extract the text from the file, preserving layout # Then use awk to extend every line to 150 characters (as lines have inconsistent lengths) rawtext=`pdftotext -layout statement.pdf - \ | awk '{while(length($0)<maxLength) $0=$0 " "; print}' maxLength=150 # Calculate number of pages in the pdf (how many times to loop) rawlen=`echo "$rawtext" | grep -o 'BALANCE BROUGHT FORWARD' | wc -l` # Extract from the ith page (postition is variable $i) the actual transaction table # Essentially selects text between "your ... details" and "BALANCE CARRIED FORWARD" # Also if a line ends in "NUM LETTER" then it's the end of a transaction - add a new line # This isn't strictly necessary but makes viewing things cleaner echo "$rawtext" | awk -v num=$i '/Your .* details/{counter++} counter==num{if(/BALANCE CARRIED FORWARD/) exit; if(/[0-9]([ ]?[A-Za-z])?$/) print $0 "\n"; else print}' Save that to a variable and you have a list of poorly formatted transactions Now you can't separate columns normally (due to inconsistencies in spacing) So wha I did was draw lines of a delimiter down to mark the columns and clean from there The only issue then is setting where the lines are, so use the headers as reference # Can't use the exact word positions as these also overlap - add slight spacing # Also as you loop per page, this positioning is also independent to each page line1=`echo "$line" | awk 'NR==2 {print index($0, " Pay")}'` line2=`echo "$line" | awk 'NR==2 {print index($0, "e nt")}'` line3=`echo "$line" | awk 'NR==2 {print index($0, " Paid o")}'` line4=`echo "$line" | awk 'NR==2 {print index($0, " Paid i")}'` line5=`echo "$line" | awk 'NR==2 {print index($0, " Balance")}'` # Use sed to add the delimiter (a | character) at these character positions # Then use "tr -s" to remove any excessive spacing (tr -s ' ' will reduce ' ' to ' ') sed "s/./|/$line1; s/./|/$line2; s/./|/$line3; s/./|/$line4; s/./|/$line5; 1,/BALANCE/d" \ | tr -s ' ' So now you have the columns separated you just need to clean the data This I just did with a bunch of awk statements - inconsistent as they are based on chatgpt Columns are [date, symbol, text, out, in, total] (although "total" really makes no sense) # First remove lines that aren't relevant # Then for lines that don't havve sed '/^[^a-zA-Z]*$/d' # Use "-F'|'" to define the delimiter separating columns - awk is fairly clever # So HSBC statements only include the date in the first item - just fills in the rest # Take the 1st column (date) and, if it's empty, copy the date from the above row awk -F'|' 'BEGIN{OFS="|"} {if($1==" ") $1=prev; else prev=$1; print}' # The 2nd column is a strange symbol ID thing - only appears once per transaction # Transactions are for some reason not all on one line so need to combine them together # This checks if that column is empty and, if so, combines the "text" section with row above awk -F'|' 'BEGIN{OFS="|"} {if($2==" ") $3=prev3 $3; print; prev3=$3}' # If no out, in or total values set then line is useless - delete it awk -F'|' '{if(!($4==" " && $5==" " && $6==" ")) print}' OFS='|' # The "total" column often has letters in it (denoting something specific to HSBC) # This just removes those letters awk -F'|' '{gsub(/[a-zA-Z]/, "", $6); print}' OFS='|' # Delete that 2nd column of symbols entirely, leaving only relevant data behind awk -F'|' '{$2=""; sub(/\|{2,}/, "|"); print}' OFS='|' What you get is a (|) dsv with all the transaction data from a HSBC bank statement Now what about starling? Much easier # Extract text, get lines starting with a date, convert big spaces to commas, contract commas # Now you have a csv of transactions - may need to use "|" to work with £1000+ items pdftotext -layout statement.pdf - | \ grep "^../../2024" | \ awk 'BEGIN{FS=" "; OFS=","} {$1=$1; print}' \ | tr -s ',' But this "pdftotext" command is not on all systems so how can I dockerise it? # In docker-compose.yml set the script to run and the file to run it on command: ["/app/${BANK}.sh", "statements/${STATEMENT:-example.pdf}"] # In Dockerfile, install poppler and bash (as script is bash dependent) FROM alpine:latest RUN apk update && apk add poppler-utils bash
2024-04-02 08:38 - Starling etc. Starling parser doesn't quite work so figuring out a solution for that now grep -E 'A|B|C' # Filter only items with "CONTACTLESS"? etc.? grep "^../../20.. " # Easier # Luckily starling puts everything on the top line of each transaction (unlike HSBC) Can I extract the columns without drawing lines again? May just be easier - take the column header line and use as reference header=`pdftotext -layout "$1" - | grep "^DATE"` echo "$header" | awk '{print index($0, " TYPE")}' # Find location of "type" column # Same method as before, draw lines down sed "s/./|/$line1; # etc. Doesn't quite work right away - some cutting of transaction items Need to shift header lines a bit more left - gave enough for £10k+ transactions Then remove double spaces and filter with more awk commands - nothing much to filter really Personally I don't like the "end of day balance" bit so I just remove that too Oh I forgot to extend the lines fully - items without end balances will have less columns awk '{while(length($0)<maxLength) $0=$0 " "; print}' maxLength=150 # Now you can remove the last column awk -F'|' '{$NF=""; print}' OFS='|' Now we just see if libreoffice can open it As long as you set delim to only "|" it seems to work perfectly Just readd the header, formatted tr -s ' ' '|' | awk -F'|' '{$NF=""; print}' OFS='|' # By the way the "-F '|'" sets the input delimiter and OFS sets output delimiter Okay now that works as expected move to Barclays - looks a bit less clean Items on multiple lines (without reason), dates assumed from above - similar to HSBC Just copied hsbc script - changing key parts and testing throughout 1: find keyword for each transaction page - "Your transactions" rawlen=`echo "$rawtext" | grep -o 'Your transactions' | wc -l` 2: set assumed positions of columns (e.g. "Money out" Not outputting anything so lines must be getting filtered in initial awk Oh yeah need the start AND end keywords - not consistent on every page There's a "continued" on most - then at the last there's "anything wrong?" If I convert that to a "continued" then use as reference that will work sed 's/Anything Wrong?/Continued/g' # This now somewhat works already - just nede to find relevant lines Seems like HSBC script almost worked as-is # If no date given, use from above awk -F'|' 'BEGIN{OFS="|"} {if($1==" ") $1=prev; else prev=$1; print}' # If no in/out, then combine the text in column 2 with above awk -F'|' 'BEGIN{OFS="|"} {if($3==" " && $4==" ") $2=prev2 $2; print; prev2=$2}' Doesn't seem to work as I want - isn't combining the lines together I could instead exploit that each transaction has a blank line after it Much easier to just do that in a loop by line Okay grouped by line by looping through and just wiping a buffer at each blank line How can I now combine the lines within the buffer to 1 clean line? linebuff=""; lineout="" while IFS='' read -r line; do if echo "$line" | grep -q -E '\| \| \| \|'; then lineDesc="`echo -e "$linebuff" | awk -F'|' '{print $2}' | paste -sd ''`" lineout="${lineout}\n`echo -e "$linebuff" | sed '/^$/d' | \ awk -F'|' -v var="$lineDesc" 'NR==1 {$2=var; print}' OFS='|'`" linebuff="" else linebuff="${linebuff}\n${line}" fi done <<< "$filttext" Looks like it is relatively clean now - it is rather slow but that's not an issue Opens in libreoffice fine but it is giving a sed error on the line drawing part Whatever it seems to not affect much so far for file in ./Barclays*; do ./barclays "${file}" > "${file%.*}.csv" & done; wait # Bit messy but it works overall Okay so now I've got Starling, HSBC and Barclays all working to the degree I want