From Paper to Pixel: Evaluating the Finest Strategies for Digitising Handwritten Texts | by Sohrab Sani | Sep, 2023

A Comparative Dive into OCR, Transformer Fashions, and Immediate Engineering-based Ensemble Strategies

By: Sohrab Sani and Diego Capozzi

Organisations have lengthy grappled with the tedious and costly process of digitising historic handwritten paperwork. Beforehand, Optical Character Recognition (OCR) methods, comparable to AWS Textract (TT) [1] and Azure Kind Recognizer (FR) [2], have led the cost for this. Though these choices could also be extensively obtainable, they’ve many downsides: they’re expensive, require prolonged information processing/cleansing and may yield suboptimal accuracies. Current Deep Studying developments in picture segmentation and Pure Language Processing that utilise transformer-based structure have enabled the event of OCR-free methods, such because the Doc Understanding Transformer (Donut)[3] mannequin.

On this examine, we’ll examine OCR and Transformer-based methods for this digitisation course of with our customized dataset, which was created from a collection of handwritten kinds. Benchmarking for this comparatively easy process is meant to guide in direction of extra complicated functions on longer, handwritten paperwork. To extend accuracy, we additionally explored utilizing an ensemble strategy by utilising immediate engineering with the gpt-3.5-turbo Giant Language Mannequin (LLM) to mix the outputs of TT and the fine-tuned Donut mannequin.

The code for this work may be seen in this GitHub repository. The dataset is obtainable on our Hugging Face repository here.

Desk of Contents:

· Dataset creation
· Methods
Azure Form Recognizer (FR)
AWS Textract (TT)
Ensemble Method: TT, Donut, GPT
· Measurement of Model Performance
· Results
· Additional Considerations
Donut model training
Prompt engineering variability
· Conclusion
· Next Steps
· References
· Acknowledgements

This examine created a customized dataset from 2100 handwritten-form photographs from the NIST Particular Database 19 dataset [4]. Determine 1 gives a pattern picture of one in every of these kinds. The ultimate assortment contains 2099 kinds. To curate this dataset, we cropped the highest part of every NIST kind, focusing on the DATE, CITY, STATE, and ZIP CODE (now known as “ZIP”) keys highlighted throughout the pink field [Figure 1]. This strategy launched the benchmarking course of with a comparatively easy text-extraction process, enabling us to then choose and manually label the dataset rapidly. On the time of writing, we’re unaware of any publicly obtainable datasets with labelled photographs of handwritten kinds that could possibly be used for JSON key-field textual content extractions.

Determine 1. Instance kind from the NIST Particular Database 19 dataset. The pink field identifies the cropping course of, which selects solely the DATE, CITY, STATE, and ZIP fields on this kind. (Picture by the authors)

We manually extracted values for every key from the paperwork and double-checked these for accuracy. In whole, 68 kinds have been discarded for holding not less than one illegible character. Characters from the kinds have been recorded precisely as they appeared, no matter spelling errors or formatting inconsistencies.

To fine-tune the Donut mannequin on lacking information, we added 67 empty kinds that will allow coaching for these empty fields. Lacking values throughout the kinds are represented as a “None” string within the JSON output.

Determine 2a shows a pattern kind from our dataset, whereas Determine 2b shares the corresponding JSON that’s then linked to that kind.

Determine 2. (a) Instance picture from the dataset; (b) Extracted information in a JSON format. (Picture by the authors)

Desk 1 gives a breakdown of variability throughout the dataset for every key. From most to least variable, the order is ZIP, CITY, DATE, and STATE. All dates have been throughout the 12 months 1989, which can have decreased total DATE variability. Moreover, though there are solely 50 US states, STATE variability was elevated as a consequence of completely different acronyms or case-sensitive spellings that have been used for particular person state entries.

Desk 1. Abstract Statistics of the Dataset. (Picture by the authors)

Desk 2 summarises character lengths for varied attributes of our dataset.

Desk 2. Abstract of character size and distribution. (Picture by the authors)

The above information exhibits that CITY entries possessed the longest character size whereas STATE entries had the shortest. The median values for every entry intently observe their respective means, indicating a comparatively uniform distribution of character lengths across the common for every class.

After annotating the information, we break up it into three subsets: coaching, validation, and testing, with respective pattern sizes of 1400, 199, and 500. Here’s a link to the pocket book that we used for this.

We’ll now develop on every methodology that we examined and hyperlink these to related Python codes which comprise extra particulars. The applying of strategies is first described individually, i.e. FR, TT and Donut, after which secondarily, with the TT+GPT+Donut ensemble strategy.

Azure Kind Recognizer (FR)

Determine 3 depicts the workflow for extracting handwritten textual content from our kind photographs utilizing Azure FR:

  1. Retailer the photographs: This could possibly be on an area drive or one other answer, comparable to an S3 bucket or Azure Blob Storage.
  2. Azure SDK: Python script for loading every picture from storage and transferring these to the FR API through Azure SDK.
  3. Submit-processing: Utilizing an off-the-shelf methodology implies that the ultimate output usually wants refining. Listed below are the 21 extracted keys that require additional processing:
    [ ‘DATE’, ‘CITY’, ‘STATE’, ‘’DATE’, ‘ZIP’, ‘NAME’, ‘E ZIP’,’·DATE’, ‘.DATE’, ‘NAMR’, ‘DATE®’, ‘NAMA’, ‘_ZIP’, ‘.ZIP’, ‘print the following shopsataca i’, ‘-DATE’, ‘DATE.’, ‘No.’, ‘NAMN’, ‘STATEnZIP’]Some keys have further dots or underscores, which require elimination. Because of the shut positioning of the textual content throughout the kinds, there are quite a few situations the place extracted values are mistakenly related to incorrect keys. These points are then addressed to an inexpensive extent.
  4. Save the end result: Save the end in a space for storing in a pickle format.
Determine 3. Visualisation of the Azure FR workflow. (Picture by the authors)

AWS Textract (TT)

Determine 4 depicts the workflow for extracting handwritten textual content from our kind photographs by utilizing AWS TT:

  1. Retailer the photographs: The photographs are saved in an S3 bucket.
  2. SageMaker Pocket book: A Pocket book occasion facilitates interplay with the TT API, executes the post-processing cleansing of the script, and saves the outcomes.
  3. TT API: That is the off-the-shelf OCR-based textual content extraction API that’s supplied by AWS.
  4. Submit-processing: Utilizing an off-the-shelf methodology implies that the ultimate output usually wants refining. TT produced a dataset with 68 columns, which is greater than 21 columns from the FR strategy. That is largely because of the detection of further textual content within the photographs considered fields. These points are then addressed through the rule-based post-processing.
  5. Save the end result: The refined information is then saved in an S3 bucket by utilizing a pickle format.
Determine 4. Visualisation of the TT workflow. (Picture by the authors)


In distinction to the off-the-shelf OCR-based approaches, that are unable to adapt to particular information enter by way of customized fields and/or mannequin retraining, this part delves into refining the OCR-free strategy by utilizing the Donut mannequin, which is predicated on transformer mannequin structure.

First, we fine-tuned the Donut mannequin with our information earlier than making use of the mannequin to our check photographs to extract the handwritten textual content in a JSON format. As a way to re-train the mannequin effectively and curb potential overfitting, we employed the EarlyStopping module from PyTorch Lightning. With a batch dimension of two, the coaching terminated after 14 epochs. Listed below are extra particulars for the fine-tuning technique of the Donut mannequin:

  • We allotted 1,400 photographs for coaching, 199 for validation, and the remaining 500 for testing.
  • We used a naver-clova-ix/donut-base as our basis mannequin, which is accessible on Hugging Face.
  • This mannequin was then fine-tuned utilizing a Quadro P6000 GPU with 24GB reminiscence.
  • The complete coaching time was roughly 3.5 hours.
  • For extra intricate configuration particulars, discuss with the train_nist.yaml within the repository.

This mannequin will also be downloaded from our Hugging Face house repository.

Ensemble Methodology: TT, Donut, GPT

Quite a lot of ensembling strategies have been explored, and the mixture of TT, Donut and GPT carried out the very best, as defined under.

As soon as the JSON outputs have been obtained by the person software of TT and Donut, these have been used as inputs to a immediate that was then handed on to GPT. The goal was to make use of GPT to take the knowledge inside these JSON inputs, mix it with contextual GPT info and create a brand new/cleaner JSON output with enhanced content material reliability and accuracy [Table 3]. Determine 5 gives a visible overview of this ensembling strategy.

Determine 5. Visible description of the ensembling methodology that mixes TT, Donut and GPT. (Picture by the authors)

The creation of the suitable GPT immediate for this process was iterative and required the introduction of ad-hoc guidelines. The tailoring of the GPT immediate to this process — and probably the dataset — is a facet of this examine that requires exploration, as famous within the Additional Considerations part.

This examine measured mannequin efficiency primarily by utilizing two distinct accuracy measures:

  • Area-Degree-Accuracy (FLA)
  • Character-Based mostly-Accuracy (CBA)

Extra portions, comparable to Protection and Price, have been additionally measured to offer related contextual info. All metrics are described under.


This can be a binary measurement: if the entire characters of the keys throughout the predicted JSON match these within the reference JSON, then the FLA is 1; if, nevertheless, only one character doesn’t match, then the FLA is 0.

Take into account the examples:

JSON1 = {'DATE': '8/28/89', 'CITY': 'Murray', 'STATE': 'KY', 'ZIP': '42171'}
JSON2 = {'DATE': '8/28/89', 'CITY': 'Murray', 'STATE': 'KY', 'ZIP': '42071'}

Evaluating JSON1 and JSON2 utilizing FLA ends in a rating of 0 because of the ZIP mismatch. Nevertheless, evaluating JSON1 with itself gives an FLA rating of 1.


This accuracy measure is computed as follows:

  1. Figuring out the Levenshtein edit distance for every corresponding worth pair.
  2. Acquiring a normalised rating by summing up the entire distances and dividing by every worth’s whole mixed string size.
  3. Changing this rating right into a proportion.

The Levenshtein edit-distance between two strings is the variety of modifications wanted to rework one string into one other. This includes counting substitutions, insertions, or deletions. For instance, reworking “marry” into “Murray” would require two substitutions and one insertion, leading to a complete of three modifications. These modifications may be made in varied sequences, however not less than three actions are crucial. For this computation, we employed the edit_distance perform from the NLTK library.

Under is a code snippet illustrating the implementation of the described algorithm. This perform accepts two JSON inputs and returns with an accuracy proportion.

def dict_distance (dict1:dict,
dict2:dict) -> float:

distance_list = []
character_length = []

for key, worth in dict1.gadgets():

if len(dict1[key]) > len(dict2[key]):


accuracy = 100 - sum(distance_list)/(sum(character_length))*100

return accuracy

To raised perceive the perform, let’s see the way it performs within the following examples:

JSON1 = {'DATE': '8/28/89', 'CITY': 'Murray', 'STATE': 'KY', 'ZIP': '42171'}
JSON2 = {'DATE': 'None', 'CITY': 'None', 'STATE': 'None', 'ZIP': 'None'}
JSON3 = {'DATE': '8/28/89', 'CITY': 'Murray', 'STATE': 'None', 'ZIP': 'None'}
  1. dict_distance(JSON1, JSON1): 100% There is no such thing as a distinction between JSON1 and JSON1, so we acquire an ideal rating of 100%
  2. dict_distance(JSON1, JSON2): 0% Each character in JSON2 would want alteration to match JSON1, yielding a 0% rating.
  3. dict_distance(JSON1, JSON3): 59% Each character within the STATE and ZIP keys of JSON3 should be modified to match JSON1, which leads to an accuracy rating of 59%.

We’ll now give attention to the common worth of CBA over the analysed picture pattern. Each of those accuracy measurements are very strict since they measure whether or not all characters and character instances from the examined strings match. FLA is especially conservative as a consequence of its binary nature, which blinds it in direction of partially appropriate instances. Though CBA is much less conservative than FLA, it’s nonetheless thought-about to be considerably conservative. Moreover, CBA has the power to establish partially appropriate situations, however it additionally considers the textual content case (higher vs. decrease), which can have differing ranges of significance relying on whether or not the main focus is to recuperate the suitable content material of the textual content or to protect the precise type of the written content material. General, we determined to make use of these stringent measurements for a extra conservative strategy since we prioritised textual content extraction correctness over textual content semantics.


This amount is outlined because the fraction of kind photographs whose fields have all been extracted within the output JSON. It’s useful to observe the general skill to extract all fields from the kinds, impartial of their correctness. If Protection may be very low, it flags that sure fields are systematically being unnoticed of the extraction course of.


This can be a easy estimate of the associated fee incurred by making use of every methodology to the whole check dataset. Now we have not captured the GPU price for fine-tuning the Donut mannequin.

We assessed the efficiency of all strategies on the check dataset, which included 500 samples. The outcomes of this course of are summarised in Desk 3.

When utilizing FLA, we observe that extra conventional OCR-based strategies, FR and TT, carry out equally with comparatively low accuracies (FLA~37%). Whereas not superb, this can be as a consequence of FLA’s stringent necessities. Alternatively, when utilizing the CBA Whole, which is the common CBA worth when accounting for all JSON keys collectively, the performances of each TT and FR are much more acceptable, yielding values > 77%. Particularly, TT (CBA Whole = 89.34%) outperforms FR by ~15%. This behaviour is then preserved when specializing in the values of CBA which can be measured for the person kind fields, notably within the DATE and CITY classes [Table 3], and when measuring the FLA and CBA Totals over the whole pattern of 2099 photographs (TT: FLA = 40.06%; CBA Whole = 86.64%; FR: FLA = 35,64%; CBA Whole = 78.57%). Whereas the Price worth for making use of these two fashions is similar, TT is best positioned to extract the entire kind fields with Protection values roughly 9% increased than the FR ones.

Desk 3. Efficiency metric values have been calculated over the check dataset. CBA Whole and CBA key (key= Date, Metropolis, State, Zip) are pattern common values and accounts, respectively, for the JSON keys altogether and individually. (Picture by the authors)

Quantifying the efficiency of those extra conventional OCR-based fashions supplied us with a benchmark that we then used to guage some great benefits of utilizing a purely Donut strategy versus utilizing one together with TT and GPT. We start this by utilizing TT as our benchmark.

The advantages of utilising this strategy are proven by way of improved metrics from the Donut mannequin that was fine-tuned on a pattern dimension of 1400 photographs and their corresponding JSON. In comparison with the TT outcomes, this mannequin’s international FLA of 54% and CBA Whole of 95.23% represent a 38% and 6% enchancment, respectively. Probably the most vital enhance was seen within the FLA, demonstrating that the mannequin can precisely retrieve all kind fields for over half of the check pattern.

The CBA enhance is notable, given the restricted variety of photographs used for fine-tuning the mannequin. The Donut mannequin exhibits advantages, as evidenced by the improved total values in Protection and key-based CBA metrics, which elevated by between 2% and 24%. Protection achieved 100%, indicating that the mannequin can extract textual content from all kind fields, which reduces the post-processing work concerned in productionizing such a mannequin.

Based mostly on this process and dataset, these outcomes illustrate that utilizing a fine-tuned Donut mannequin produces outcomes which can be superior to these produced by an OCR mannequin. Lastly, ensembling strategies have been explored to evaluate if further enhancements might proceed to be made.

The efficiency of the ensemble of TT and fine-tuned Donut, powered by gpt-3.5-turbo, reveals that enhancements are potential if particular metrics, comparable to FLA, are chosen. All the metrics for this mannequin (excluding CBA State and Protection) present a rise, ranging between ~0.2% and ~10%, in comparison with these for our fine-tuned Donut mannequin. The one efficiency degradation is seen within the CBA State, which decreases by ~3% when in comparison with the worth measured for our fine-tuned Donut mannequin. This can be owed to the GPT immediate that was used, which may be additional fine-tuned to enhance this metric. Lastly, the Protection worth stays unchanged at 100%.

When in comparison with the opposite particular person fields, Date extraction (see CBA Date) yielded increased effectivity. This was probably because of the restricted variability within the Date subject since all Dates originated in 1989.

If the efficiency necessities are significantly conservative, then the ten% enhance in FLA is important and will advantage the upper price of constructing and sustaining a extra complicated infrastructure. This also needs to contemplate the supply of variability launched by the LLM immediate modification, which is famous within the Additional Considerations part. Nevertheless, if the efficiency necessities are much less stringent, then the CBA metric enhancements yielded by this ensemble methodology could not advantage the extra price and energy.

General, our examine exhibits that whereas particular person OCR-based strategies — specifically FR and TT — have their strengths, the Donut mannequin, fine-tuned on 1400 samples solely, simply surpasses their accuracy benchmark. Moreover, ensembling TT and a fine-tuned Donut mannequin by a gpt-3.5-turbo immediate additional will increase accuracy when measured by the FLA metric. Additional Considerations should even be made regarding the fine-tuning technique of the Donut mannequin and the GPT immediate, which are actually explored within the following part.

Donut mannequin coaching

To enhance the accuracy of the Donut mannequin, we experimented with three coaching approaches, every aimed toward bettering inference accuracy whereas stopping overfitting to the coaching information. Desk 4 shows a abstract of our outcomes.

Desk 4. Abstract of the Donut mannequin fine-tuning. (Picture by the authors)

1. The 30-Epoch Coaching: We skilled the Donut mannequin for 30 epochs utilizing a configuration supplied within the Donut GitHub repository. This coaching session lasted for roughly 7 hours and resulted in an FLA of fifty.0%. The CBA values for various classes assorted, with CITY reaching a price of 90.55% and ZIP reaching 98.01%. Nevertheless, we seen that the mannequin began overfitting after the nineteenth epoch once we examined the val_metric.

2. The 19-Epoch Coaching: Based mostly on insights gained through the preliminary coaching, we fine-tuned the mannequin for less than 19 epochs. Our outcomes confirmed a major enchancment in FLA, which reached 55.8%. The general CBA, in addition to key-based CBAs, confirmed improved accuracy values. Regardless of these promising metrics, we detected a touch of overfitting, as indicated by the val_metric.

3. The 14-Epoch Coaching: To additional refine our mannequin and curb potential overfitting, we employed the EarlyStopping module from PyTorch Lightning. This strategy terminated the coaching after 14 epochs. This resulted in an FLA of 54.0%, and CBAs have been comparable, if not higher, than the 19-epoch coaching.

When evaluating the outputs from these three coaching periods, though the 19-epoch coaching yielded a slightly higher FLA, the CBA metrics within the 14-epoch coaching have been total superior. Moreover, the val_metric strengthened our apprehension relating to the 19-epoch coaching, indicating a slight inclination in direction of overfitting.

In conclusion, we deduced that the mannequin that was fine-tuned over 14 epochs utilizing EarlyStopping was each probably the most sturdy and probably the most cost-efficient.

Immediate engineering variability

We labored on two immediate engineering approaches (ver1 & ver2) to enhance information extraction effectivity by ensembling a fine-tuned Donut mannequin and our outcomes from TT. After coaching the mannequin for 14 epochs, Immediate ver1 yielded superior outcomes with an FLA of 59.6% and better CBA metrics for all keys [Table 5]. In distinction, Immediate ver2 skilled a decline, with its FLA dropping to 54.4%. An in depth have a look at the CBA metrics indicated that accuracy scores for each class in ver2 have been barely decrease when in comparison with these of ver1, highlighting the numerous distinction this alteration made.

Desk 5. Abstract of outcomes: extracting handwritten textual content from kinds. (Picture by the authors)

Throughout our handbook labelling technique of the dataset, we utilised the outcomes of TT and FR, and developed Immediate ver1 whereas annotating the textual content from the kinds. Regardless of being intrinsically similar to its predecessor, Immediate ver2 was barely modified. Our main aim was to refine the immediate by eliminating empty strains and redundant areas that have been current in Immediate ver1.

In abstract, our experimentation highlighted the nuanced impression of seemingly minor changes. Whereas Immediate ver1 showcased a better accuracy, the method of refining and simplifying it into Immediate ver2, paradoxically, led to a discount in efficiency throughout all metrics. This highlights the intricate nature of immediate engineering and the necessity for meticulous testing earlier than finalising a immediate to be used.

Immediate ver1 is obtainable in this Notebook, and the code for Immediate ver2 may be seen here.

We created a benchmark dataset for textual content extraction from photographs of handwritten kinds containing 4 fields (DATE, CITY, STATE, and ZIP). These kinds have been manually annotated right into a JSON format. We used this dataset to evaluate the performances of OCR-based fashions (FR and TT) and a Donut mannequin, which was then fine-tuned utilizing our dataset. Lastly, we employed an ensemble mannequin that we constructed by way of immediate engineering by utilising an LLM (gpt-3.5-turbo) with TT and our fine-tuned Donut mannequin outputs.

We discovered that TT carried out higher than FR and used this as a benchmark to guage potential enhancements that could possibly be generated by a Donut mannequin in isolation or together with TT and GPT, which is the ensemble strategy. As displayed by the mannequin efficiency metrics, this fine-tuned Donut mannequin confirmed clear accuracy enhancements that justify its adoption over OCR-based fashions. The ensemble mannequin displayed vital enchancment of FLA however comes at a better price and due to this fact, may be thought-about for utilization in instances with stricter efficiency necessities. Regardless of using the constant underlying mannequin, gpt-3.5-turbo, we noticed notable variations within the output JSON kind when minor modifications within the immediate have been made. Such unpredictability is a major downside when utilizing off-the-shelf LLMs in manufacturing. We’re presently growing a extra compact cleansing course of primarily based on an open-source LLM to handle this situation.

  • The worth column in Desk 2 exhibits that the OpenAI API name was the most costly cognitive service used on this work. Thus, to minimise prices, we’re engaged on fine-tuning an LLM for a seq2seq process by utilising strategies comparable to full fine-tuning, immediate tuning[5] and QLORA [6].
  • On account of privateness causes, the identify field on the photographs within the dataset is roofed by a black rectangle. We’re engaged on updating this by including random first and final names to the dataset, which might enhance the information extraction fields from 4 to 5.
  • Sooner or later, we plan to extend the complexity of the text-extraction process by extending this examine to incorporate textual content extraction of whole kinds or different extra in depth paperwork.
  • Examine Donut mannequin hyperparameter optimization.
  1. Amazon Textract, AWS Textract
  2. Kind Recognizer, Form Recognizer (now Document Intelligence)
  3. Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun, OCR-free Document Understanding Transformer (2022), European Convention on Laptop Imaginative and prescient (ECCV)
  4. Grother, P. and Hanaoka, Okay. (2016) NIST Handwritten Types and Characters Database (NIST Particular Database 19). DOI:
  5. Brian Lester, Rami Al-Rfou, Noah Fixed, The Power of Scale for Parameter-Efficient Prompt Tuning (2021), arXiv:2104.08691
  6. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, QLoRA: Efficient Finetuning of Quantized LLMs (2023),

We want to thank our colleague, Dr. David Rodrigues, for his steady assist and discussions surrounding this undertaking. We might additionally prefer to thank Kainos for his or her assist.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button