ホーム > データ(活用する) > About OCR-related projects in FY2021 > 1 OCR text conversion of digitized materials in FY2021

1 OCR text conversion of digitized materials in FY2021

In FY2021, the National Diet Library, Japan, (NDL) utilized funding allocated from the Third Supplementary Budget for FY2020 to implement a project in which LINE Corporation was contracted to convert almost all of the NDL’s roughly 2.47 million digitized materials into text data using optical character recognition (OCR).

1. Purpose of the project

Most of the digitized materials from the NDL are made available in image format without text data. Recent progress in OCR processing has made it feasible to create text data from image data and to make the text data available via a full-text search service. Thus, the NDL prioritized a project to convert its digitized materials into text data using OCR and to make them available via a search service.

Nearly half of the digitized materials at the NDL were published before 1945, and we found that the performance of existing Japanese OCR services and software on these older materials is much lower than that for recent publications. Simply put, it is more difficult for the OCR to recognize the out-of-date character forms and formats used in these older materials than the simpler, more readable layouts of modern publications.

Also, securing computers and other equipment necessary for processing more than 200 million digitized images to create full-text data was a major challenge.

We felt it would be necessary to optimize recognition accuracy for a diverse range of materials as well as ensure an appropriate processing speed. Therefore, we outsourced this OCR text conversion project to create text data using an AI-OCR processing program with improved performance.

2. Target materials for OCR text conversion

In this project, approximately 2.47 million books and periodicals captured in roughly 223 million images that were available via the National Diet Library Digital Collections as of December 2020 have been converted to OCR-generated text data. A detailed breakdown is shown in the table below.

Collection Round number Number of images
Periodicals 1,320,000 72,462,853
Books 973,000 137,728,493
Doctoral Dissertations 149,000 12,449,873
Official Gazettes 21,000 387,962
TV and radio scripts 3,000 137,138
Maps 600 566
Documents of the Imperial Library 200 27,838
Total 2,466,300 223,194,723

3. Outline of the project

First, the contractor created OCR training datasets by analyzing and processing digitized materials provided by the NDL from the Books and Periodicals indicated in section 2. Next, the contractor sought to optimize accuracy and throughput of the OCR processing program using machine learning to analyze the digitized materials provided by the NDL.

After this period of research and development, the NDL examined the improved OCR processing program to ensure that it met both of the two predefined performance criteria shown in sections 4.1. and 4.2. below. Subsequent to approval by the NDL, the OCR text data conversion work was performed on all of the target materials shown in section 2.

Also, concurrently with the work described above, a machine learning model was developed to assign layout information for structuring headings, annotations, page numbers, columns, texts, figures, and tables in the text data. Thus, we were able to use this model to obtain automatically layout information for books published during or after the 1960s.

For further information, please see the following document:

4. Predefined performance criteria

4.1 Character recognition performance

4.1.1. Target materials for the evaluation of character recognition performance

The target materials comprised books and periodicals for which publication dates were known.

They were divided into the following 33 segments, each of which was assigned a criterion.

  • Books (20 segments) classified by decade of publication from the 1870s to the 1960s and classified as either Science (classes 4 to 6) or Humanities (other classes) per the Nippon Decimal Classification (NDC) system
  • Periodicals (13 segments) classified by decade of publication from the 1870s to the 1990s.

96.9% of the JIS Kanji characters appearing in the 330 images prepared by the NDL for the performance measurement test were included in JIS Level 1 and 2. (Excluding "々", "仝", "〇", etc.)

4.1.2. Evaluation method

The recognition performance was evaluated by calculating an F-score ($F_{measure}$) for each image based on character units.

The F-score is defined as follows:

$$F_{measure}=\dfrac{2Recall*Precision}{Recall+Precision}$$

where

$$y_{true}=\{The \: multiset \: of \: characters \: in \: the \: correct \: character \: information\}$$

$$y_{pred}=\{The \: multiset \: of \: characters \: in \: the \: recognition \: result\}$$

$$Precision=\dfrac{|y_{pred} \cap y_{true}|}{|y_{pred}|},Recall=\dfrac{|y_{pred} \cap y_{true}|}{|y_{true}|}$$

Thus, the F-score is between 0 and 1, with values closer to 1 indicating higher recognition performance.

4.1.3. Recognition performance criteria

Median F-scores were required to be higher than the criteria indicated below in at least 30 of 33 segments.

Type Publication Date Category Criteria
Books 1870 Humanities 0.63
Books 1870 Science 0.66
Books 1880 Humanities 0.71
Books 1880 Science 0.72
Books 1890 Humanities 0.73
Books 1890 Science 0.73
Books 1900 Humanities 0.80
Books 1900 Science 0.79
Books 1910 Humanities 0.84
Books 1910 Science 0.86
Books 1920 Humanities 0.90
Books 1920 Science 0.91
Books 1930 Humanities 0.91
Books 1930 Science 0.91
Books 1940 Humanities 0.94
Books 1940 Science 0.92
Books 1950 Humanities 0.95
Books 1950 Science 0.96
Books 1960 Humanities 0.97
Books 1960 Science 0.98
Periodicals 1870 - 0.72
Periodicals 1880 - 0.78
Periodicals 1890 - 0.80
Periodicals 1900 - 0.90
Periodicals 1910 - 0.85
Periodicals 1920 - 0.92
Periodicals 1930 - 0.91
Periodicals 1940 - 0.93
Periodicals 1950 - 0.94
Periodicals 1960 - 0.96
Periodicals 1970 - 0.98
Periodicals 1980 - 0.97
Periodicals 1990 - 0.97
mean 0.86

N.B. Basis for character recognition performance criteria

The criteria for text recognition performance were defined by the following steps: First, we manually created ground-truth data for 330 images created by sampling 10 images from each of 33 segments. Next, we measured the recognition performance of the following three OCR programs in the same way and adopted the highest F-scores in each category as a criterion. In other words, the criteria is the combined best performance of these three OCR programs. The OCR programs we used were obtained from the websites of the providers as of October 2020.

4.2. Detecting reading direction

4.2.1. Target materials for evaluating detection of reading direction

In principle, all of the materials were evaluated.

4.2.2. Criteria for evaluating detection of reading direction

If the reading direction (either vertical or horizontal) of more than half of the character strings in an image was visually determined to be incorrect in the output result with line breaks removed, the reading direction of that image was considered incorrect. If 95% or more of the target images were read in the correct direction, the result was considered acceptable. Therefore, if a sentence or phrase could be read in a way that made sense, we considered the reading direction to have been detected correctly.

5. Results of improvement in character recognition performance

The table below shows the results of the evaluation using a total of 3,630 images, including the 330-image dataset prepared by the NDL for the performance measurement test and the 3,300-image dataset prepared by the contractor. (For individual results and details for each dataset, see Final Results of Performance Evaluation (Excel file 41KB).

Of the 33 categories, actual performance exceeded the criteria in 32 categories. The only exception was periodicals published in the 1970s.

Type Publication Date Category Result Criteria Difference
Books 1870 Humanities 0.9147 0.63 +0.2847
Books 1870 Science 0.9013 0.66 +0.2413
Books 1880 Humanities 0.9568 0.71 +0.2468
Books 1880 Science 0.9416 0.72 +0.2216
Books 1890 Humanities 0.9595 0.73 +0.2295
Books 1890 Science 0.9599 0.73 +0.2299
Books 1900 Humanities 0.9651 0.80 +0.1651
Books 1900 Science 0.9645 0.79 +0.1745
Books 1910 Humanities 0.9710 0.84 +0.1310
Books 1910 Science 0.9686 0.86 +0.1086
Books 1920 Humanities 0.9775 0.90 +0.0775
Books 1920 Science 0.9794 0.91 +0.0694
Books 1930 Humanities 0.9765 0.91 +0.0665
Books 1930 Science 0.9776 0.91 +0.0676
Books 1940 Humanities 0.9862 0.94 +0.0462
Books 1940 Science 0.9764 0.92 +0.0564
Books 1950 Humanities 0.9895 0.95 +0.0395
Books 1950 Science 0.9767 0.96 +0.0167
Books 1960 Humanities 0.9908 0.97 +0.0208
Books 1960 Science 0.9838 0.98 +0.0038
Periodicals 1870 - 0.9646 0.72 +0.2446
Periodicals 1880 - 0.9684 0.78 +0.1884
Periodicals 1890 - 0.9721 0.80 +0.1721
Periodicals 1900 - 0.9738 0.90 +0.0738
Periodicals 1910 - 0.9716 0.85 +0.1216
Periodicals 1920 - 0.9757 0.92 +0.0557
Periodicals 1930 - 0.9717 0.91 +0.0617
Periodicals 1940 - 0.9684 0.93 +0.0384
Periodicals 1950 - 0.9702 0.94 +0.0302
Periodicals 1960 - 0.9794 0.96 +0.0194
Periodicals 1970 - 0.9721 0.98 -0.0079
Periodicals 1980 - 0.9807 0.97 +0.0107
Periodicals 1990 - 0.9786 0.97 +0.0086
mean 0.9686 0.86 +0.1065

6. Information on project output

6.1. Publication of output

The deliverables from the OCR text conversion project are currently available in the following format:

(1) Character type dataset of the project

This document lists all the character types (23,026) obtained in this project (UTF-8).

(2) OCR training dataset created during performance improvement (in the public domain)

The OCR training dataset created from materials for which copyright protection has expired is publicly available.

(3) Next Digital Library

This is an experimental retrieval system from the NDL that enables full-text search and downloading of full-text data for a part of the OCR text data created in this project, namely 280,000 books for which copyright protection has expired.

(4) NDL Ngram Viewer

This is also an experimental service that visualizes and lists the frequency of occurrence of a search term by publication date for the same text data of the 280,000 books as the Next Digital Library.

6.2 Use of full-text data in the NDL's services

Most of the full-text data of 2.47 million materials created in this project will be provided through the full-text search function implemented in the National Diet Library Digital Collections, which is scheduled to be renewed in December 2022.

Furthermore, in FY2022, full-text data will be provided to the visually impaired through the Data Transmission Services for persons with Print Disabilities, except in cases where e-books and other products are commercially available.