2024-04-02 - 007-PDFParser

INDEX

2024-03-12 12:00 PDF Parser

I wrote a basic bash script to parse HSBC bank statements, at the request of my friend
They're structured in a fairly difficult, inconsistent way but did most of it using awk
	# Use "pdftotext" to extract the text from the file, preserving layout
	# Then use awk to extend every line to 150 characters (as lines have inconsistent lengths)
	rawtext=`pdftotext -layout statement.pdf - \
	 | awk '{while(length($0)<maxLength) $0=$0 " "; print}' maxLength=150

	# Calculate number of pages in the pdf (how many times to loop)
	rawlen=`echo "$rawtext" | grep -o 'BALANCE BROUGHT FORWARD' | wc -l`

	# Extract from the ith page (postition is variable $i) the actual transaction table
	# Essentially selects text between "your ... details" and "BALANCE CARRIED FORWARD"

	# Also if a line ends in "NUM LETTER" then it's the end of a transaction - add a new line
	# This isn't strictly necessary but makes viewing things cleaner
	echo "$rawtext" | awk -v num=$i '/Your .* details/{counter++}
	 counter==num{if(/BALANCE CARRIED FORWARD/) exit;
	 if(/[0-9]([ ]?[A-Za-z])?$/) print $0 "\n"; else print}'

Save that to a variable and you have a list of poorly formatted transactions
Now you can't separate columns normally (due to inconsistencies in spacing)
So wha I did was draw lines of a delimiter down to mark the columns and clean from there
The only issue then is setting where the lines are, so use the headers as reference
	# Can't use the exact word positions as these also overlap - add slight spacing
	# Also as you loop per page, this positioning is also independent to each page
	line1=`echo "$line" | awk 'NR==2 {print index($0, "  Pay")}'`
	line2=`echo "$line" | awk 'NR==2 {print index($0, "e nt")}'`
	line3=`echo "$line" | awk 'NR==2 {print index($0, "   Paid o")}'`
	line4=`echo "$line" | awk 'NR==2 {print index($0, "   Paid i")}'`
	line5=`echo "$line" | awk 'NR==2 {print index($0, "  Balance")}'`

	# Use sed to add the delimiter (a | character) at these character positions
	# Then use "tr -s" to remove any excessive spacing (tr -s ' ' will reduce '   ' to ' ')
	sed "s/./|/$line1; s/./|/$line2; s/./|/$line3; s/./|/$line4; s/./|/$line5; 1,/BALANCE/d" \
	 | tr -s '   '
	
So now you have the columns separated you just need to clean the data
This I just did with a bunch of awk statements - inconsistent as they are based on chatgpt
Columns are [date, symbol, text, out, in, total] (although "total" really makes no sense)
	# First remove lines that aren't relevant
	# Then for lines that don't havve
	sed '/^[^a-zA-Z]*$/d'

	# Use "-F'|'" to define the delimiter separating columns - awk is fairly clever
	# So HSBC statements only include the date in the first item - just fills in the rest
	# Take the 1st column (date) and, if it's empty, copy the date from the above row
	awk -F'|' 'BEGIN{OFS="|"} {if($1==" ") $1=prev; else prev=$1; print}'

	# The 2nd column is a strange symbol ID thing - only appears once per transaction
	# Transactions are for some reason not all on one line so need to combine them together
	# This checks if that column is empty and, if so, combines the "text" section with row above
	awk -F'|' 'BEGIN{OFS="|"} {if($2==" ") $3=prev3 $3; print; prev3=$3}'

	# If no out, in or total values set then line is useless - delete it
	awk -F'|' '{if(!($4==" " && $5==" " && $6==" ")) print}' OFS='|'

	# The "total" column often has letters in it (denoting something specific to HSBC)
	# This just removes those letters
	awk -F'|' '{gsub(/[a-zA-Z]/, "", $6); print}' OFS='|'

	# Delete that 2nd column of symbols entirely, leaving only relevant data behind
	awk -F'|' '{$2=""; sub(/\|{2,}/, "|"); print}' OFS='|'

What you get is a (|) dsv with all the transaction data from a HSBC bank statement

Now what about starling? Much easier
	# Extract text, get lines starting with a date, convert big spaces to commas, contract commas
	# Now you have a csv of transactions - may need to use "|" to work with £1000+ items
	pdftotext -layout statement.pdf - | \
	 grep "^../../2024" | \
	 awk 'BEGIN{FS="  "; OFS=","} {$1=$1; print}' \
	 | tr -s ','

But this "pdftotext" command is not on all systems so how can I dockerise it?
	# In docker-compose.yml set the script to run and the file to run it on
	command: ["/app/${BANK}.sh", "statements/${STATEMENT:-example.pdf}"]

	# In Dockerfile, install poppler and bash (as script is bash dependent)
	FROM alpine:latest
	RUN apk update && apk add poppler-utils bash


2024-04-02 08:38 - Starling etc.

Starling parser doesn't quite work so figuring out a solution for that now
	grep -E 'A|B|C' # Filter only items with "CONTACTLESS"? etc.?
	grep "^../../20..  " # Easier
	# Luckily starling puts everything on the top line of each transaction (unlike HSBC)

Can I extract the columns without drawing lines again?
May just be easier - take the column header line and use as reference
	header=`pdftotext -layout "$1" - | grep "^DATE"`
	echo "$header" | awk '{print index($0, " TYPE")}' # Find location of "type" column
	# Same method as before, draw lines down
	sed "s/./|/$line1; # etc.

Doesn't quite work right away - some cutting of transaction items
Need to shift header lines a bit more left - gave enough for £10k+ transactions

Then remove double spaces and filter with more awk commands - nothing much to filter really
Personally I don't like the "end of day balance" bit so I just remove that too
Oh I forgot to extend the lines fully - items without end balances will have less columns
	awk '{while(length($0)<maxLength) $0=$0 " "; print}' maxLength=150
	# Now you can remove the last column
	awk -F'|' '{$NF=""; print}' OFS='|'

Now we just see if libreoffice can open it
As long as you set delim to only "|" it seems to work perfectly
Just readd the header, formatted
	tr -s ' ' '|' | awk -F'|' '{$NF=""; print}' OFS='|'
	# By the way the "-F '|'" sets the input delimiter and OFS sets output delimiter

Okay now that works as expected move to Barclays - looks a bit less clean
Items on multiple lines (without reason), dates assumed from above - similar to HSBC

Just copied hsbc script - changing key parts and testing throughout
1: find keyword for each transaction page - "Your transactions"
	rawlen=`echo "$rawtext" | grep -o 'Your transactions' | wc -l`
2: set assumed positions of columns (e.g. "Money out"

Not outputting anything so lines must be getting filtered in initial awk
Oh yeah need the start AND end keywords - not consistent on every page
There's a "continued" on most - then at the last there's "anything wrong?"
If I convert that to a "continued" then use as reference that will work
	sed 's/Anything Wrong?/Continued/g'
	# This now somewhat works already - just nede to find relevant lines

Seems like HSBC script almost worked as-is
	# If no date given, use from above 
	awk -F'|' 'BEGIN{OFS="|"} {if($1==" ") $1=prev; else prev=$1; print}'
	# If no in/out, then combine the text in column 2 with above
	awk -F'|' 'BEGIN{OFS="|"} {if($3==" " && $4==" ") $2=prev2 $2; print; prev2=$2}'

Doesn't seem to work as I want - isn't combining the lines together
I could instead exploit that each transaction has a blank line after it
Much easier to just do that in a loop by line

Okay grouped by line by looping through and just wiping a buffer at each blank line
How can I now combine the lines within the buffer to 1 clean line?
	linebuff=""; lineout=""
	while IFS='' read -r line; do
	  if echo "$line" | grep -q -E '\| \| \| \|'; then
	    lineDesc="`echo -e "$linebuff" | awk -F'|' '{print $2}' | paste -sd ''`"
	    lineout="${lineout}\n`echo -e "$linebuff" | sed '/^$/d' | \
	     awk -F'|' -v var="$lineDesc" 'NR==1 {$2=var; print}' OFS='|'`"
	    linebuff=""
	  else
	    linebuff="${linebuff}\n${line}"
	  fi
	done <<< "$filttext"

Looks like it is relatively clean now - it is rather slow but that's not an issue
Opens in libreoffice fine but it is giving a sed error on the line drawing part
Whatever it seems to not affect much so far
	for file in ./Barclays*; do ./barclays "${file}" > "${file%.*}.csv" & done; wait
	# Bit messy but it works overall

Okay so now I've got Starling, HSBC and Barclays all working to the degree I want