ホーム > データ（活用する） > About OCR-related projects in FY2021 > 2 Development of Japanese OCR software in FY2021

2 Development of Japanese OCR software in FY2021

In FY2021, the National Diet Library, Japan, (NDL) utilized funding allocated from the Third Supplementary Budget for FY2020 to implement a project in which Morpho AI Solutions was contracted to develop a machine-learnable optical character recognition (OCR) software for conversion of digitized materials to Japanese text.

1. Purpose of the project

Most of the digitized materials from the NDL made available in image format without text data. Recent progress in OCR processing has made it feasible to create text data from image data and to make the text data available via a full-text search service. Thus, the NDL prioritized a project to convert its digitized materials into text data using OCR and to make them available via a search service.

In 2021, the NDL implemented an OCR text conversion project and acquired full-text data for all materials digitized by 2020. In order to create full-text data for materials digitized during or after 2021, however, we felt it was necessary to have an OCR software incorporating the latest technology, such as machine learning, that was optimized for the NDL's digitized materials and would be available for use at any time.

We felt it would be necessary to optimize recognition accuracy for a diverse range of digitized materials as well as ensure an appropriate processing speed. Therefore we outsourced the research and development of an OCR software that would meet our needs and could also be made available as open source.

2. Materials used

In this project, development was based on approximately 2.47 million books and periodicals captured in 223 million images that were available via the National Diet Library Digital Collections as of December 2020.

Collection	Round number	Number of images
Periodicals	1,320,000	72,462,853
Books	973,000	137,728,493
Total	2,293,000	210,191,346

3. Outline of the project

First, the contractor created OCR training datasets by analyzing and processing digitized materials provided by the NDL from the Books and Periodicals indicated in section 2. Next, we conducted research and development of layout recognition technology to enable high-quality text conversion optimized for digitized materials at the NDL as well as structured handling of the output text data.

During the latter half of the R&D period, we evaluated the performance of the developed OCR software and were able to confirm that both of the performance criteria indicated in sections 4.1. and 4.2. as well as the processing speed requirements for the NDL server environment specified in section 4.3. had been met.

Also, concurrently with the work described above, we also developed a machine learning model that automatically assigned layout information for structuring headings, author names, ruby, annotations, page numbers, columns, text, figures, and tables in the text data. The layout information is provided by means of layout recognition performed during OCR processing.

For further information, please see the following document:

Final Report (excerpts) (PDF 4.6MB) (in Japanese)

4. Predefined performance criteria

4.1. Performance in character recognition

4.1.1. Target materials for the evaluation of character recognition performance

The target materials comprised books and periodicals for which publication dates were known.

They were divided into the following 33 segments, each of which was assigned a criterion.

Books (20 segments) classified by decade of publication from the 1870s to the 1960s and classified as either Science (classes 4 to 6) or Humanities (other classes) per the Nippon Decimal Classification (NDC) system.
Periodicals (13 segments) classified by decade of publication from the 1870s to the 1990s.

96.9% of the JIS Kanji characters appearing in the 330 images prepared by the NDL for the performance measurement test were included in JIS Level 1 and 2. (Excluding "々", "仝", "〇", etc.)

4.1.2. Evaluation Method

The recognition performance was evaluated by calculating an F-score for each image based on character units.

The F-score is defined as follows:

$$F_{measure}=\dfrac{2Recall*Precision}{Recall+Precision}$$

where

$$y_{true}=\{The \: multiset \: of \: characters \: in \: the \: correct \: character \: information\}$$

$$y_{pred}=\{The \: multiset \: of \: characters \: in \: the \: recognition \: result\}$$

$$Precision=\dfrac{|y_{pred} \cap y_{true}|}{|y_{pred}|},Recall=\dfrac{|y_{pred} \cap y_{true}|}{|y_{true}|}$$

Thus, the F-score is between 0 and 1, with values closer to 1 indicating higher recognition performance.

4.1.3. Recognition performance criteria

The median F-scores were required to be higher than the criteria indicated below in at 30 of 33 segments, excluding the 3 segments used for reference.

Type	Publication Date	Category	Criteria
Books	1870	Humanities	0.40
Books	1870	Science	0.43
Books	1880	Humanities	0.65
Books	1880	Science	0.54
Books	1890	Humanities	0.73
Books	1890	Science	0.60
Books	1900	Humanities	0.79
Books	1900	Science	0.77
Books	1910	Humanities	0.81
Books	1910	Science	0.83
Books	1920	Humanities	0.89
Books	1920	Science	0.89
Books	1930	Humanities	0.91
Books	1930	Science	0.89
Books	1940	Humanities	0.92
Books	1940	Science	0.90
Books	1950	Humanities	0.94
Books	1950	Science	0.94
Books	1960	Humanities	0.97
Books	1960	Science	0.97
Periodicals	1870	-	0.65
Periodicals	1880	-	0.74
Periodicals	1890	-	0.78
Periodicals	1900	-	0.88
Periodicals	1910	-	0.84
Periodicals	1920	-	0.89
Periodicals	1930	-	0.91
Periodicals	1940	-	0.91
Periodicals	1950	-	0.91
Periodicals	1960	-	0.96
Periodicals	1970	-	(for reference) 0.96
Periodicals	1980	-	(for reference) 0.95
Periodicals	1990	-	(for reference) 0.96
mean	-	-	(including reference criteria) 0.79
mean	-	-	(excluding reference criteria) 0.78

N.B. Basis for character recognition performance criteria

The criteria for text recognition performance were defined by the following steps: First, we manually created ground-truth data for 330 images created by sampling 10 images from each of 33 segments. Next, we measured the recognition performance of the following three OCRs in the same way and adopted the median F-score in each category as a criterion. The OCR programs we used were acquired from the websites of the providers as of October 2020.

4.2. Detecting reading direction

4.2.1. Target materials for evaluation detection of reading direction

In principle, all of the materials were evaluated.

4.2.2.　Criteria for evaluating detection of reading direction

If the reading direction (vertical or horizontal) of more than half of the character strings in an image was visually determined to be incorrect in the output result with line breaks removed, the reading direction of that image was considered incorrect. If 95% or more of the target images were read in the correct direction, the result was considered acceptable. Therefore, if a sentence or phrase could be read in a way that made sense, we considered the reading direction to have been detected correctly.

4.3. Processing Speed Requirements

The processing time (excluding file input/output) must be less than 2 seconds per page (less than 4 seconds per image) in the NDL server environment shown below.

OS: Ubuntu 18.04LTS
Intel(R) Xeon(R) W-3245 CPU @ 3.20GHz 1 unit
GPU:NVIDIA Geforce RTX 2080Ti 1 unit

5. Achieved character recognition performance

The table below shows the results of the evaluation using the 330-image dataset prepared by the NDL for the performance measurement test.

Of 33 categories, actual performance exceeded the criteria in 32 categories. Since the only category that did not meet the criterion, periodicals published in the 1990s, is a reference value that was not necessary to meet the target, the requirement was met in all 30 categories targeted in the performance evaluation. The evaluation was performed by the NDL using a docker container on the library’s server (environment details are shown in section 4.3). The processing time per page was about 1.5 seconds.

Type	Publication Date	Category	Result	Criteria	Difference
Books	1870	Humanities	0.9147	0.40	+0.5147
Books	1870	Science	0.9174	0.43	+0.4874
Books	1880	Humanities	0.9450	0.65	+0.2950
Books	1880	Science	0.9492	0.54	+0.4092
Books	1890	Humanities	0.9531	0.73	+0.2231
Books	1890	Science	0.9257	0.60	+0.3257
Books	1900	Humanities	0.9559	0.79	+0.1659
Books	1900	Science	0.9584	0.77	+0.1884
Books	1910	Humanities	0.9520	0.81	+0.1420
Books	1910	Science	0.9380	0.83	+0.1080
Books	1920	Humanities	0.9641	0.89	+0.0741
Books	1920	Science	0.9593	0.89	+0.0693
Books	1930	Humanities	0.9624	0.91	+0.0524
Books	1930	Science	0.9624	0.89	+0.0724
Books	1940	Humanities	0.9647	0.92	+0.0447
Books	1940	Science	0.9607	0.90	+0.0607
Books	1950	Humanities	0.9803	0.94	+0.0403
Books	1950	Science	0.9605	0.94	+0.0205
Books	1960	Humanities	0.9824	0.97	+0.0124
Books	1960	Science	0.9743	0.97	+0.0043
Periodicals	1870	-	0.9290	0.65	+0.2790
Periodicals	1880	-	0.9411	0.74	+0.2011
Periodicals	1890	-	0.9404	0.78	+0.1604
Periodicals	1900	-	0.9525	0.88	+0.0725
Periodicals	1910	-	0.9285	0.84	+0.0885
Periodicals	1920	-	0.9514	0.89	+0.0614
Periodicals	1930	-	0.9534	0.91	+0.0434
Periodicals	1940	-	0.9626	0.91	+0.0526
Periodicals	1950	-	0.9548	0.91	+0.0448
Periodicals	1960	-	0.9846	0.96	+0.0246
Periodicals	1970	-	0.9693	0.96	+0.0093
Periodicals	1980	-	0.9605	0.95	+0.0105
Periodicals	1990	-	0.9376	0.96	-0.0224
平均	-	-	0.9548	0.79	+0.1648

6. Information on project output

6.1. Publication of output

(1) Japanese OCR software (NDLOCR)

Rights issues necessary to release the software as open source have been resolved. The OSS libraries used are also from permissive OSS (such as MIT, BSD or Apache2), allowing free secondary use for both commercial and non-commercial purposes. The software is divided into seven repositories for each functionality but can be built and used as a Docker container by following the instructions in the repository below.

Repository ndlocr_cli

(2) Character sets covered by NDLOCR

(3) OCR training dataset created through development work (in the public domain)

OCR training dataset

6.2. Use of OCR software in NDL services

The OCR software developed in this project will be used to create full-text data from materials digitized after 2021. The full-text data thus created will be added to the NDL Digital Collections, which is scheduled to be renewed in December 2022, and used as materials for full-text search. We also plan to provide the full-text data through the Data Transmission Services for persons with Print Disabilities.

Furthermore, in FY2022, an R&D project to improve the OCR software (NDLOCR) will be implemented under the Supplementary Budget for FY2021 to provide full-text data to visually impaired persons.

データ（活用する）: カテゴリメニューを閉じるカテゴリメニューを開く