1 OCR text conversion of digitized materials in FY2021
In FY2021, the National Diet Library, Japan, (NDL) utilized funding allocated from the Third Supplementary Budget for FY2020 to implement a project in which LINE Corporation was contracted to convert almost all of the NDL’s roughly 2.47 million digitized materials into text data using optical character recognition (OCR).
1. Purpose of the project
Most of the digitized materials from the NDL are made available in image format without text data. Recent progress in OCR processing has made it feasible to create text data from image data and to make the text data available via a full-text search service. Thus, the NDL prioritized a project to convert its digitized materials into text data using OCR and to make them available via a search service.
Nearly half of the digitized materials at the NDL were published before 1945, and we found that the performance of existing Japanese OCR services and software on these older materials is much lower than that for recent publications. Simply put, it is more difficult for the OCR to recognize the out-of-date character forms and formats used in these older materials than the simpler, more readable layouts of modern publications.
Also, securing computers and other equipment necessary for processing more than 200 million digitized images to create full-text data was a major challenge.
We felt it would be necessary to optimize recognition accuracy for a diverse range of materials as well as ensure an appropriate processing speed. Therefore, we outsourced this OCR text conversion project to create text data using an AI-OCR processing program with improved performance.
2. Target materials for OCR text conversion
In this project, approximately 2.47 million books and periodicals captured in roughly 223 million images that were available via the National Diet Library Digital Collections as of December 2020 have been converted to OCR-generated text data. A detailed breakdown is shown in the table below.
Collection | Round number | Number of images |
---|---|---|
Periodicals | 1,320,000 | 72,462,853 |
Books | 973,000 | 137,728,493 |
Doctoral Dissertations | 149,000 | 12,449,873 |
Official Gazettes | 21,000 | 387,962 |
TV and radio scripts | 3,000 | 137,138 |
Maps | 600 | 566 |
Documents of the Imperial Library | 200 | 27,838 |
Total | 2,466,300 | 223,194,723 |
3. Outline of the project
First, the contractor created OCR training datasets by analyzing and processing digitized materials provided by the NDL from the Books and Periodicals indicated in section 2. Next, the contractor sought to optimize accuracy and throughput of the OCR processing program using machine learning to analyze the digitized materials provided by the NDL.
After this period of research and development, the NDL examined the improved OCR processing program to ensure that it met both of the two predefined performance criteria shown in sections 4.1. and 4.2. below. Subsequent to approval by the NDL, the OCR text data conversion work was performed on all of the target materials shown in section 2.
Also, concurrently with the work described above, a machine learning model was developed to assign layout information for structuring headings, annotations, page numbers, columns, texts, figures, and tables in the text data. Thus, we were able to use this model to obtain automatically layout information for books published during or after the 1960s.
For further information, please see the following document:
4. Predefined performance criteria
4.1 Character recognition performance
4.1.1. Target materials for the evaluation of character recognition performance
The target materials comprised books and periodicals for which publication dates were known.
They were divided into the following 33 segments, each of which was assigned a criterion.
- Books (20 segments) classified by decade of publication from the 1870s to the 1960s and classified as either Science (classes 4 to 6) or Humanities (other classes) per the Nippon Decimal Classification (NDC) system
- Periodicals (13 segments) classified by decade of publication from the 1870s to the 1990s.
96.9% of the JIS Kanji characters appearing in the 330 images prepared by the NDL for the performance measurement test were included in JIS Level 1 and 2. (Excluding "々", "仝", "〇", etc.)
4.1.2. Evaluation method
The recognition performance was evaluated by calculating an F-score ($F_{measure}$) for each image based on character units.
The F-score is defined as follows:
$$F_{measure}=\dfrac{2Recall*Precision}{Recall+Precision}$$
where
$$y_{true}=\{The \: multiset \: of \: characters \: in \: the \: correct \: character \: information\}$$
$$y_{pred}=\{The \: multiset \: of \: characters \: in \: the \: recognition \: result\}$$
$$Precision=\dfrac{|y_{pred} \cap y_{true}|}{|y_{pred}|},Recall=\dfrac{|y_{pred} \cap y_{true}|}{|y_{true}|}$$
Thus, the F-score is between 0 and 1, with values closer to 1 indicating higher recognition performance.
4.1.3. Recognition performance criteria
Median F-scores were required to be higher than the criteria indicated below in at least 30 of 33 segments.
Type | Publication Date | Category | Criteria |
---|---|---|---|
Books | 1870 | Humanities | 0.63 |
Books | 1870 | Science | 0.66 |
Books | 1880 | Humanities | 0.71 |
Books | 1880 | Science | 0.72 |
Books | 1890 | Humanities | 0.73 |
Books | 1890 | Science | 0.73 |
Books | 1900 | Humanities | 0.80 |
Books | 1900 | Science | 0.79 |
Books | 1910 | Humanities | 0.84 |
Books | 1910 | Science | 0.86 |
Books | 1920 | Humanities | 0.90 |
Books | 1920 | Science | 0.91 |
Books | 1930 | Humanities | 0.91 |
Books | 1930 | Science | 0.91 |
Books | 1940 | Humanities | 0.94 |
Books | 1940 | Science | 0.92 |
Books | 1950 | Humanities | 0.95 |
Books | 1950 | Science | 0.96 |
Books | 1960 | Humanities | 0.97 |
Books | 1960 | Science | 0.98 |
Periodicals | 1870 | - | 0.72 |
Periodicals | 1880 | - | 0.78 |
Periodicals | 1890 | - | 0.80 |
Periodicals | 1900 | - | 0.90 |
Periodicals | 1910 | - | 0.85 |
Periodicals | 1920 | - | 0.92 |
Periodicals | 1930 | - | 0.91 |
Periodicals | 1940 | - | 0.93 |
Periodicals | 1950 | - | 0.94 |
Periodicals | 1960 | - | 0.96 |
Periodicals | 1970 | - | 0.98 |
Periodicals | 1980 | - | 0.97 |
Periodicals | 1990 | - | 0.97 |
mean | 0.86 |
N.B. Basis for character recognition performance criteria
The criteria for text recognition performance were defined by the following steps: First, we manually created ground-truth data for 330 images created by sampling 10 images from each of 33 segments. Next, we measured the recognition performance of the following three OCR programs in the same way and adopted the highest F-scores in each category as a criterion. In other words, the criteria is the combined best performance of these three OCR programs. The OCR programs we used were obtained from the websites of the providers as of October 2020.
4.2. Detecting reading direction
4.2.1. Target materials for evaluating detection of reading direction
In principle, all of the materials were evaluated.
4.2.2. Criteria for evaluating detection of reading direction
If the reading direction (either vertical or horizontal) of more than half of the character strings in an image was visually determined to be incorrect in the output result with line breaks removed, the reading direction of that image was considered incorrect. If 95% or more of the target images were read in the correct direction, the result was considered acceptable. Therefore, if a sentence or phrase could be read in a way that made sense, we considered the reading direction to have been detected correctly.
5. Results of improvement in character recognition performance
The table below shows the results of the evaluation using a total of 3,630 images, including the 330-image dataset prepared by the NDL for the performance measurement test and the 3,300-image dataset prepared by the contractor. (For individual results and details for each dataset, see Final Results of Performance Evaluation (Excel file 41KB).
Of the 33 categories, actual performance exceeded the criteria in 32 categories. The only exception was periodicals published in the 1970s.
Type | Publication Date | Category | Result | Criteria | Difference |
---|---|---|---|---|---|
Books | 1870 | Humanities | 0.9147 | 0.63 | +0.2847 |
Books | 1870 | Science | 0.9013 | 0.66 | +0.2413 |
Books | 1880 | Humanities | 0.9568 | 0.71 | +0.2468 |
Books | 1880 | Science | 0.9416 | 0.72 | +0.2216 |
Books | 1890 | Humanities | 0.9595 | 0.73 | +0.2295 |
Books | 1890 | Science | 0.9599 | 0.73 | +0.2299 |
Books | 1900 | Humanities | 0.9651 | 0.80 | +0.1651 |
Books | 1900 | Science | 0.9645 | 0.79 | +0.1745 |
Books | 1910 | Humanities | 0.9710 | 0.84 | +0.1310 |
Books | 1910 | Science | 0.9686 | 0.86 | +0.1086 |
Books | 1920 | Humanities | 0.9775 | 0.90 | +0.0775 |
Books | 1920 | Science | 0.9794 | 0.91 | +0.0694 |
Books | 1930 | Humanities | 0.9765 | 0.91 | +0.0665 |
Books | 1930 | Science | 0.9776 | 0.91 | +0.0676 |
Books | 1940 | Humanities | 0.9862 | 0.94 | +0.0462 |
Books | 1940 | Science | 0.9764 | 0.92 | +0.0564 |
Books | 1950 | Humanities | 0.9895 | 0.95 | +0.0395 |
Books | 1950 | Science | 0.9767 | 0.96 | +0.0167 |
Books | 1960 | Humanities | 0.9908 | 0.97 | +0.0208 |
Books | 1960 | Science | 0.9838 | 0.98 | +0.0038 |
Periodicals | 1870 | - | 0.9646 | 0.72 | +0.2446 |
Periodicals | 1880 | - | 0.9684 | 0.78 | +0.1884 |
Periodicals | 1890 | - | 0.9721 | 0.80 | +0.1721 |
Periodicals | 1900 | - | 0.9738 | 0.90 | +0.0738 |
Periodicals | 1910 | - | 0.9716 | 0.85 | +0.1216 |
Periodicals | 1920 | - | 0.9757 | 0.92 | +0.0557 |
Periodicals | 1930 | - | 0.9717 | 0.91 | +0.0617 |
Periodicals | 1940 | - | 0.9684 | 0.93 | +0.0384 |
Periodicals | 1950 | - | 0.9702 | 0.94 | +0.0302 |
Periodicals | 1960 | - | 0.9794 | 0.96 | +0.0194 |
Periodicals | 1970 | - | 0.9721 | 0.98 | -0.0079 |
Periodicals | 1980 | - | 0.9807 | 0.97 | +0.0107 |
Periodicals | 1990 | - | 0.9786 | 0.97 | +0.0086 |
mean | 0.9686 | 0.86 | +0.1065 |
6. Information on project output
6.1. Publication of output
The deliverables from the OCR text conversion project are currently available in the following format:
(1) Character type dataset of the project
This document lists all the character types (23,026) obtained in this project (UTF-8).
(2) OCR training dataset created during performance improvement (in the public domain)
The OCR training dataset created from materials for which copyright protection has expired is publicly available.
(3) Next Digital Library
This is an experimental retrieval system from the NDL that enables full-text search and downloading of full-text data for a part of the OCR text data created in this project, namely 280,000 books for which copyright protection has expired.
(4) NDL Ngram Viewer
This is also an experimental service that visualizes and lists the frequency of occurrence of a search term by publication date for the same text data of the 280,000 books as the Next Digital Library.
6.2 Use of full-text data in the NDL's services
Most of the full-text data of 2.47 million materials created in this project will be provided through the full-text search function implemented in the National Diet Library Digital Collections, which is scheduled to be renewed in December 2022.
Furthermore, in FY2022, full-text data will be provided to the visually impaired through the Data Transmission Services for persons with Print Disabilities, except in cases where e-books and other products are commercially available.