How to Easily Create a Document Classification Machine Learning Function
Document classification is a key machine learning (ML) application that simplifies the organization and management of business documents, like invoices, work orders, and various types of legal contracts. Using machine learning functions like image or text classification, businesses can organize documents based on their content into specific, predefined groups or labels.
You can use different approaches for document classification, including natural language processing (NLP) for classifying text and computer vision for image classification. In this article, we’ll focus on image classification, which involves training models to recognize patterns and features in the images of documents — allowing businesses to categorize documents by their type. We’ll also show you how to quickly and easily create a custom image classification function to manage your business documents better.
Think a pretrained function could work for your business needs? Try out our pretrained document types classifier. If your use case requires you to sort and label document types not included in the pretrained function, keep reading to learn how to create a custom model.
Why businesses need document classification
Document classification can bypass the tedious, manual sorting and labeling of business documents by automating the process with machine learning. Of course, this can be especially valuable for organizations dealing with a high volume of diverse document types. These organizations can use image classification to automate the sorting of invoices, contracts, or receipts, improving the efficiency of document routing and helping employees find the documents they need quickly.
1. Enhance workflow efficiency
Document classification can help businesses quickly and accurately route documents to different workflows based on their content. Automating this process ensures that documents go through the appropriate channels for approvals, reviews, or further processing. This automation not only saves time but also minimizes the risk of errors associated with manual routing.
2. Facilitate easy search and discovery
Document classification is also beneficial for search and discovery purposes. With document classification, businesses can assign relevant tags to documents, enabling users to quickly retrieve and locate specific information when needed. This capability becomes increasingly valuable as document databases grow, making information retrieval more efficient.
For instance, imagine a construction materials company routinely receives photos of documents submitted via their field workers’ smartphones. With document classification, the company could automatically identify when a document is a purchase order, speeding up how quickly they can fulfill customers’ orders. They could even pair their image classification function with optical character recognition (OCR) to convert images of text into machine-readable text to determine what material the customer is ordering, such as cement.
Approaches to document classification: Text and image classification
Businesses can use either image classification or text classification to label their documents. Image classification is particularly effective for documents with distinct visual features that can help facilitate classification without the classifier needing to read the text. On the other hand, OCR is the go-to tool for extracting text from images that a text classification model can then interpret.
1. Classify documents with image classification
Image classification excels when differentiating documents with distinct and consistent visual styles, like invoices with a distinct visual structure. However, image classification is less effective when the textual content is critical for accurate classification. In these cases, OCR or text classification would be better.
A simple way to determine whether image classification is the right choice is to consider if you can categorize a document based on its appearance alone. Image classification is a viable option if visual elements provide enough information for a model to classify the documents accurately.
2. Classify documents with text classification
Text classification becomes necessary when you need to interpret detailed text to classify a document. This approach is invaluable for content requiring deeper textual analysis, such as legal documents. For example, a law firm dealing with contracts, pleadings, and agreements would benefit from text classification to accurately categorize documents based on language nuances and legal terminology.
For documents in image form, OCR is the first step to extract the text. Once extracted, text classification powered by NLP enables the system to analyze and categorize text based on linguistic patterns, keywords, and context. This method is crucial in understanding complex documents, allowing the system to differentiate between various document types.
Building a document classifier with Nyckel
Creating an image classification function with Nyckel to classify document types is a lot simpler than it may sound. Let’s consider an example where the visual cues of the documents are enough for a classifier to learn the difference between various document types.
We’ll use the RVL-CDIP dataset, which contains 400,000 grayscale images across 16 document types. For our classifier example, we’ll focus on two document types: advertisements and news articles. However, you could create a Nyckel classifier for all 16 document types. We’ll also train our model on just 500 images per class rather than on all of the images in the training dataset.
With Nyckel, you can use the Nyckel Python SDK or work entirely through Nyckel’s interface. See how I built the document classification function with Nyckel’s web interface:
Image classification with the Nyckel Python SDK
In addition to the user-friendly web interface I used above, Nyckel makes it simple to train an image classification model using the Python SDK. You could even use the SDK in conjunction with Nyckel’s UI. Here’s how to do that:
1. Use Python code to train your model
Once you’ve downloaded the converted images from the HuggingFace dataset, here is all the code you need to create a Nyckel function:
import os
from nyckel import User, ImageClassificationFunction
user = User(client_id="...", client_secret="...")
# Set the paths for the news article and advertisement directories
news_article_dir = 'news_article'
advertisement_dir = 'advertisement'
training_data = []
# Get all the filenames in the folder
for file in os.listdir(advertisement_dir):
filename = advertisement_dir + '/' + file
training_data.append((filename, 'advertisement'))
for file in os.listdir(news_article_dir):
filename = news_article_dir + '/' + file
training_data.append((filename, 'news_article'))
func = ImageClassificationFunction.new("IsAdvertisementOrNewsArticle", user)
func.create_samples(training_data)
2. Try out the model
After the model finishes training, we can check its performance and see that it correctly identified 483 out of 500 news articles as “news_article” (96.6%) and 479 out of 500 advertisements as “advertisement” (95.8%).
You can also test how the model performs when you invoke it with a test image. Simply call `func.invoke()` with a list of the images you’d like the model to classify (i.e., `function.invoke([‘news_article1.png”, “news_article2.png”]`). The model will return its predictions for each image, as well as its confidence for each prediction.
3. Improve your model over time
Once you’ve trained your initial model, you can continuously improve it by adding more data. It’s especially helpful to focus on adding data samples that include features your model has difficulty with, such as news articles that the model mistakes for advertisements and vice versa. When you add new annotated data to the platform, Nyckel will automatically retrain a new model.
Additionally, you can use Nyckel’s invoke capture feature to annotate captured data periodically. This feature automatically gathers random data and instances with low-confidence predictions from the model’s invokes and places them in a queue for your review. You can then annotate this data to retrain the model, making the process of refining and fine-tuning your model quicker and more intuitive. This iterative process ensures that your classifier evolves and adapts to the nuances of the data it encounters, ultimately improving its accuracy and reliability.
Easily classify your documents with Nyckel
Interested in how Nyckel can address your document classification needs? Sign up for a free account or reach out to us at any time for support along the way. You can also check out Nyckel’s pretrained function, Document Types Classifier, an image classifier trained on the full RVL-CDIP dataset.
To further expand your toolkit with Nyckel beyond image classification, check out our Text Classification Quick Start, which demonstrates how to categorize text into desired categories, and our OCR Quick Start, which shows you how to extract text from images without performing any prior training.