ホーム > データ（活用する） > About OCR-related projects in FY2021 > 1 OCR text conversion of digitized materials in FY2021

1 OCR text conversion of digitized materials in FY2021

In FY2021, the National Diet Library, Japan, (NDL) utilized funding allocated from the Third Supplementary Budget for FY2020 to implement a project in which LINE Corporation was contracted to convert almost all of the NDL’s roughly 2.47 million digitized materials into text data using optical character recognition (OCR).

1. Purpose of the project

Most of the digitized materials from the NDL are made available in image format without text data. Recent progress in OCR processing has made it feasible to create text data from image data and to make the text data available via a full-text search service. Thus, the NDL prioritized a project to convert its digitized materials into text data using OCR and to make them available via a search service.

Nearly half of the digitized materials at the NDL were published before 1945, and we found that the performance of existing Japanese OCR services and software on these older materials is much lower than that for recent publications. Simply put, it is more difficult for the OCR to recognize the out-of-date character forms and formats used in these older materials than the simpler, more readable layouts of modern publications.

Also, securing computers and other equipment necessary for processing more than 200 million digitized images to create full-text data was a major challenge.

We felt it would be necessary to optimize recognition accuracy for a diverse range of materials as well as ensure an appropriate processing speed. Therefore, we outsourced this OCR text conversion project to create text data using an AI-OCR processing program with improved performance.

2. Target materials for OCR text conversion

In this project, approximately 2.47 million books and periodicals captured in roughly 223 million images that were available via the National Diet Library Digital Collections as of December 2020 have been converted to OCR-generated text data. A detailed breakdown is shown in the table below.

Collection	Round number	Number of images
Periodicals	1,320,000	72,462,853
Books	973,000	137,728,493
Doctoral Dissertations	149,000	12,449,873
Official Gazettes	21,000	387,962
TV and radio scripts	3,000	137,138
Maps	600	566
Documents of the Imperial Library	200	27,838
Total	2,466,300	223,194,723

3. Outline of the project

First, the contractor created OCR training datasets by analyzing and processing digitized materials provided by the NDL from the Books and Periodicals indicated in section 2. Next, the contractor sought to optimize accuracy and throughput of the OCR processing program using machine learning to analyze the digitized materials provided by the NDL.

After this period of research and development, the NDL examined the improved OCR processing program to ensure that it met both of the two predefined performance criteria shown in sections 4.1. and 4.2. below. Subsequent to approval by the NDL, the OCR text data conversion work was performed on all of the target materials shown in section 2.

Also, concurrently with the work described above, a machine learning model was developed to assign layout information for structuring headings, annotations, page numbers, columns, texts, figures, and tables in the text data. Thus, we were able to use this model to obtain automatically layout information for books published during or after the 1960s.

For further information, please see the following document:

Improvement Result Report（PDF 2.6MB）(in Japanese)

4. Predefined performance criteria

4.1 Character recognition performance

4.1.1. Target materials for the evaluation of character recognition performance

The target materials comprised books and periodicals for which publication dates were known.

They were divided into the following 33 segments, each of which was assigned a criterion.

Books (20 segments) classified by decade of publication from the 1870s to the 1960s and classified as either Science (classes 4 to 6) or Humanities (other classes) per the Nippon Decimal Classification (NDC) system
Periodicals (13 segments) classified by decade of publication from the 1870s to the 1990s.

96.9% of the JIS Kanji characters appearing in the 330 images prepared by the NDL for the performance measurement test were included in JIS Level 1 and 2. (Excluding "々", "仝", "〇", etc.)

4.1.2. Evaluation method

The recognition performance was evaluated by calculating an F-score ($F_{measure}$) for each image based on character units.

The F-score is defined as follows:

$$F_{measure}=\dfrac{2Recall*Precision}{Recall+Precision}$$

where

$$y_{true}=\{The \: multiset \: of \: characters \: in \: the \: correct \: character \: information\}$$

$$y_{pred}=\{The \: multiset \: of \: characters \: in \: the \: recognition \: result\}$$

$$Precision=\dfrac{|y_{pred} \cap y_{true}|}{|y_{pred}|},Recall=\dfrac{|y_{pred} \cap y_{true}|}{|y_{true}|}$$

Thus, the F-score is between 0 and 1, with values closer to 1 indicating higher recognition performance.

4.1.3. Recognition performance criteria

Median F-scores were required to be higher than the criteria indicated below in at least 30 of 33 segments.

Type	Publication Date	Category	Criteria
Books	1870	Humanities	0.63
Books	1870	Science	0.66
Books	1880	Humanities	0.71
Books	1880	Science	0.72
Books	1890	Humanities	0.73
Books	1890	Science	0.73
Books	1900	Humanities	0.80
Books	1900	Science	0.79
Books	1910	Humanities	0.84
Books	1910	Science	0.86
Books	1920	Humanities	0.90
Books	1920	Science	0.91
Books	1930	Humanities	0.91
Books	1930	Science	0.91
Books	1940	Humanities	0.94
Books	1940	Science	0.92
Books	1950	Humanities	0.95
Books	1950	Science	0.96
Books	1960	Humanities	0.97
Books	1960	Science	0.98
Periodicals	1870	-	0.72
Periodicals	1880	-	0.78
Periodicals	1890	-	0.80
Periodicals	1900	-	0.90
Periodicals	1910	-	0.85
Periodicals	1920	-	0.92
Periodicals	1930	-	0.91
Periodicals	1940	-	0.93
Periodicals	1950	-	0.94
Periodicals	1960	-	0.96
Periodicals	1970	-	0.98
Periodicals	1980	-	0.97
Periodicals	1990	-	0.97
mean			0.86

N.B. Basis for character recognition performance criteria

The criteria for text recognition performance were defined by the following steps: First, we manually created ground-truth data for 330 images created by sampling 10 images from each of 33 segments. Next, we measured the recognition performance of the following three OCR programs in the same way and adopted the highest F-scores in each category as a criterion. In other words, the criteria is the combined best performance of these three OCR programs. The OCR programs we used were obtained from the websites of the providers as of October 2020.

4.2. Detecting reading direction

4.2.1. Target materials for evaluating detection of reading direction

In principle, all of the materials were evaluated.

4.2.2.　Criteria for evaluating detection of reading direction

If the reading direction (either vertical or horizontal) of more than half of the character strings in an image was visually determined to be incorrect in the output result with line breaks removed, the reading direction of that image was considered incorrect. If 95% or more of the target images were read in the correct direction, the result was considered acceptable. Therefore, if a sentence or phrase could be read in a way that made sense, we considered the reading direction to have been detected correctly.

5. Results of improvement in character recognition performance

The table below shows the results of the evaluation using a total of 3,630 images, including the 330-image dataset prepared by the NDL for the performance measurement test and the 3,300-image dataset prepared by the contractor. (For individual results and details for each dataset, see Final Results of Performance Evaluation (Excel file 41KB).

Of the 33 categories, actual performance exceeded the criteria in 32 categories. The only exception was periodicals published in the 1970s.

Type	Publication Date	Category	Result	Criteria	Difference
Books	1870	Humanities	0.9147	0.63	+0.2847
Books	1870	Science	0.9013	0.66	+0.2413
Books	1880	Humanities	0.9568	0.71	+0.2468
Books	1880	Science	0.9416	0.72	+0.2216
Books	1890	Humanities	0.9595	0.73	+0.2295
Books	1890	Science	0.9599	0.73	+0.2299
Books	1900	Humanities	0.9651	0.80	+0.1651
Books	1900	Science	0.9645	0.79	+0.1745
Books	1910	Humanities	0.9710	0.84	+0.1310
Books	1910	Science	0.9686	0.86	+0.1086
Books	1920	Humanities	0.9775	0.90	+0.0775
Books	1920	Science	0.9794	0.91	+0.0694
Books	1930	Humanities	0.9765	0.91	+0.0665
Books	1930	Science	0.9776	0.91	+0.0676
Books	1940	Humanities	0.9862	0.94	+0.0462
Books	1940	Science	0.9764	0.92	+0.0564
Books	1950	Humanities	0.9895	0.95	+0.0395
Books	1950	Science	0.9767	0.96	+0.0167
Books	1960	Humanities	0.9908	0.97	+0.0208
Books	1960	Science	0.9838	0.98	+0.0038
Periodicals	1870	-	0.9646	0.72	+0.2446
Periodicals	1880	-	0.9684	0.78	+0.1884
Periodicals	1890	-	0.9721	0.80	+0.1721
Periodicals	1900	-	0.9738	0.90	+0.0738
Periodicals	1910	-	0.9716	0.85	+0.1216
Periodicals	1920	-	0.9757	0.92	+0.0557
Periodicals	1930	-	0.9717	0.91	+0.0617
Periodicals	1940	-	0.9684	0.93	+0.0384
Periodicals	1950	-	0.9702	0.94	+0.0302
Periodicals	1960	-	0.9794	0.96	+0.0194
Periodicals	1970	-	0.9721	0.98	-0.0079
Periodicals	1980	-	0.9807	0.97	+0.0107
Periodicals	1990	-	0.9786	0.97	+0.0086
mean			0.9686	0.86	+0.1065

6. Information on project output

6.1. Publication of output

The deliverables from the OCR text conversion project are currently available in the following format:

(1) Character type dataset of the project

This document lists all the character types (23,026) obtained in this project (UTF-8).

Dataset of all character types (text file 64KB)

(2) OCR training dataset created during performance improvement (in the public domain)

The OCR training dataset created from materials for which copyright protection has expired is publicly available.

OCR training dataset (in the public domain)

(3) Next Digital Library

This is an experimental retrieval system from the NDL that enables full-text search and downloading of full-text data for a part of the OCR text data created in this project, namely 280,000 books for which copyright protection has expired.

Next Digital Library

(4) NDL Ngram Viewer

This is also an experimental service that visualizes and lists the frequency of occurrence of a search term by publication date for the same text data of the 280,000 books as the Next Digital Library.

NDL Ngram Viewer

6.2 Use of full-text data in the NDL's services

Most of the full-text data of 2.47 million materials created in this project will be provided through the full-text search function implemented in the National Diet Library Digital Collections, which is scheduled to be renewed in December 2022.

Furthermore, in FY2022, full-text data will be provided to the visually impaired through the Data Transmission Services for persons with Print Disabilities, except in cases where e-books and other products are commercially available.

データ（活用する）: カテゴリメニューを閉じるカテゴリメニューを開く