ホーム > データ(活用する) > About OCR-related projects in FY2021 > 2 Development of Japanese OCR software in FY2021

2 Development of Japanese OCR software in FY2021

In FY2021, the National Diet Library, Japan, (NDL) utilized funding allocated from the Third Supplementary Budget for FY2020 to implement a project in which Morpho AI Solutions was contracted to develop a machine-learnable optical character recognition (OCR) software for conversion of digitized materials to Japanese text.

1. Purpose of the project

Most of the digitized materials from the NDL made available in image format without text data. Recent progress in OCR processing has made it feasible to create text data from image data and to make the text data available via a full-text search service. Thus, the NDL prioritized a project to convert its digitized materials into text data using OCR and to make them available via a search service.

In 2021, the NDL implemented an OCR text conversion project and acquired full-text data for all materials digitized by 2020. In order to create full-text data for materials digitized during or after 2021, however, we felt it was necessary to have an OCR software incorporating the latest technology, such as machine learning, that was optimized for the NDL's digitized materials and would be available for use at any time.

We felt it would be necessary to optimize recognition accuracy for a diverse range of digitized materials as well as ensure an appropriate processing speed. Therefore we outsourced the research and development of an OCR software that would meet our needs and could also be made available as open source.

2. Materials used

In this project, development was based on approximately 2.47 million books and periodicals captured in 223 million images that were available via the National Diet Library Digital Collections as of December 2020.

Collection Round number Number of images
Periodicals 1,320,000 72,462,853
Books 973,000 137,728,493
Total 2,293,000 210,191,346

3. Outline of the project

First, the contractor created OCR training datasets by analyzing and processing digitized materials provided by the NDL from the Books and Periodicals indicated in section 2. Next, we conducted research and development of layout recognition technology to enable high-quality text conversion optimized for digitized materials at the NDL as well as structured handling of the output text data.

During the latter half of the R&D period, we evaluated the performance of the developed OCR software and were able to confirm that both of the performance criteria indicated in sections 4.1. and 4.2. as well as the processing speed requirements for the NDL server environment specified in section 4.3. had been met.

Also, concurrently with the work described above, we also developed a machine learning model that automatically assigned layout information for structuring headings, author names, ruby, annotations, page numbers, columns, text, figures, and tables in the text data. The layout information is provided by means of layout recognition performed during OCR processing.

For further information, please see the following document:

4. Predefined performance criteria

4.1. Performance in character recognition

4.1.1. Target materials for the evaluation of character recognition performance

The target materials comprised books and periodicals for which publication dates were known.

They were divided into the following 33 segments, each of which was assigned a criterion.

  • Books (20 segments) classified by decade of publication from the 1870s to the 1960s and classified as either Science (classes 4 to 6) or Humanities (other classes) per the Nippon Decimal Classification (NDC) system.
  • Periodicals (13 segments) classified by decade of publication from the 1870s to the 1990s.

96.9% of the JIS Kanji characters appearing in the 330 images prepared by the NDL for the performance measurement test were included in JIS Level 1 and 2. (Excluding "々", "仝", "〇", etc.)

4.1.2. Evaluation Method

The recognition performance was evaluated by calculating an F-score for each image based on character units.

The F-score is defined as follows:

$$F_{measure}=\dfrac{2Recall*Precision}{Recall+Precision}$$

where

$$y_{true}=\{The \: multiset \: of \: characters \: in \: the \: correct \: character \: information\}$$

$$y_{pred}=\{The \: multiset \: of \: characters \: in \: the \: recognition \: result\}$$

$$Precision=\dfrac{|y_{pred} \cap y_{true}|}{|y_{pred}|},Recall=\dfrac{|y_{pred} \cap y_{true}|}{|y_{true}|}$$

Thus, the F-score is between 0 and 1, with values closer to 1 indicating higher recognition performance.

4.1.3. Recognition performance criteria

The median F-scores were required to be higher than the criteria indicated below in at 30 of 33 segments, excluding the 3 segments used for reference.

Type Publication Date Category Criteria
Books 1870 Humanities 0.40
Books 1870 Science 0.43
Books 1880 Humanities 0.65
Books 1880 Science 0.54
Books 1890 Humanities 0.73
Books 1890 Science 0.60
Books 1900 Humanities 0.79
Books 1900 Science 0.77
Books 1910 Humanities 0.81
Books 1910 Science 0.83
Books 1920 Humanities 0.89
Books 1920 Science 0.89
Books 1930 Humanities 0.91
Books 1930 Science 0.89
Books 1940 Humanities 0.92
Books 1940 Science 0.90
Books 1950 Humanities 0.94
Books 1950 Science 0.94
Books 1960 Humanities 0.97
Books 1960 Science 0.97
Periodicals 1870 - 0.65
Periodicals 1880 - 0.74
Periodicals 1890 - 0.78
Periodicals 1900 - 0.88
Periodicals 1910 - 0.84
Periodicals 1920 - 0.89
Periodicals 1930 - 0.91
Periodicals 1940 - 0.91
Periodicals 1950 - 0.91
Periodicals 1960 - 0.96
Periodicals 1970 - (for reference) 0.96
Periodicals 1980 - (for reference) 0.95
Periodicals 1990 - (for reference) 0.96
mean - - (including reference criteria) 0.79
mean - - (excluding reference criteria) 0.78

N.B. Basis for character recognition performance criteria

The criteria for text recognition performance were defined by the following steps: First, we manually created ground-truth data for 330 images created by sampling 10 images from each of 33 segments. Next, we measured the recognition performance of the following three OCRs in the same way and adopted the median F-score in each category as a criterion. The OCR programs we used were acquired from the websites of the providers as of October 2020.

4.2. Detecting reading direction

4.2.1. Target materials for evaluation detection of reading direction

In principle, all of the materials were evaluated.

4.2.2. Criteria for evaluating detection of reading direction

If the reading direction (vertical or horizontal) of more than half of the character strings in an image was visually determined to be incorrect in the output result with line breaks removed, the reading direction of that image was considered incorrect. If 95% or more of the target images were read in the correct direction, the result was considered acceptable. Therefore, if a sentence or phrase could be read in a way that made sense, we considered the reading direction to have been detected correctly.

4.3. Processing Speed Requirements

The processing time (excluding file input/output) must be less than 2 seconds per page (less than 4 seconds per image) in the NDL server environment shown below.

  • OS: Ubuntu 18.04LTS
  • Intel(R) Xeon(R) W-3245 CPU @ 3.20GHz 1 unit
  • GPU:NVIDIA Geforce RTX 2080Ti 1 unit

5. Achieved character recognition performance

The table below shows the results of the evaluation using the 330-image dataset prepared by the NDL for the performance measurement test.

Of 33 categories, actual performance exceeded the criteria in 32 categories. Since the only category that did not meet the criterion, periodicals published in the 1990s, is a reference value that was not necessary to meet the target, the requirement was met in all 30 categories targeted in the performance evaluation. The evaluation was performed by the NDL using a docker container on the library’s server (environment details are shown in section 4.3). The processing time per page was about 1.5 seconds.

Type Publication Date Category Result Criteria Difference
Books 1870 Humanities 0.9147 0.40 +0.5147
Books 1870 Science 0.9174 0.43 +0.4874
Books 1880 Humanities 0.9450 0.65 +0.2950
Books 1880 Science 0.9492 0.54 +0.4092
Books 1890 Humanities 0.9531 0.73 +0.2231
Books 1890 Science 0.9257 0.60 +0.3257
Books 1900 Humanities 0.9559 0.79 +0.1659
Books 1900 Science 0.9584 0.77 +0.1884
Books 1910 Humanities 0.9520 0.81 +0.1420
Books 1910 Science 0.9380 0.83 +0.1080
Books 1920 Humanities 0.9641 0.89 +0.0741
Books 1920 Science 0.9593 0.89 +0.0693
Books 1930 Humanities 0.9624 0.91 +0.0524
Books 1930 Science 0.9624 0.89 +0.0724
Books 1940 Humanities 0.9647 0.92 +0.0447
Books 1940 Science 0.9607 0.90 +0.0607
Books 1950 Humanities 0.9803 0.94 +0.0403
Books 1950 Science 0.9605 0.94 +0.0205
Books 1960 Humanities 0.9824 0.97 +0.0124
Books 1960 Science 0.9743 0.97 +0.0043
Periodicals 1870 - 0.9290 0.65 +0.2790
Periodicals 1880 - 0.9411 0.74 +0.2011
Periodicals 1890 - 0.9404 0.78 +0.1604
Periodicals 1900 - 0.9525 0.88 +0.0725
Periodicals 1910 - 0.9285 0.84 +0.0885
Periodicals 1920 - 0.9514 0.89 +0.0614
Periodicals 1930 - 0.9534 0.91 +0.0434
Periodicals 1940 - 0.9626 0.91 +0.0526
Periodicals 1950 - 0.9548 0.91 +0.0448
Periodicals 1960 - 0.9846 0.96 +0.0246
Periodicals 1970 - 0.9693 0.96 +0.0093
Periodicals 1980 - 0.9605 0.95 +0.0105
Periodicals 1990 - 0.9376 0.96 -0.0224
平均 - - 0.9548 0.79 +0.1648

6. Information on project output

6.1. Publication of output

(1) Japanese OCR software (NDLOCR)

Rights issues necessary to release the software as open source have been resolved. The OSS libraries used are also from permissive OSS (such as MIT, BSD or Apache2), allowing free secondary use for both commercial and non-commercial purposes. The software is divided into seven repositories for each functionality but can be built and used as a Docker container by following the instructions in the repository below.

(2) Character sets covered by NDLOCR

(3) OCR training dataset created through development work (in the public domain)

6.2. Use of OCR software in NDL services

The OCR software developed in this project will be used to create full-text data from materials digitized after 2021. The full-text data thus created will be added to the NDL Digital Collections, which is scheduled to be renewed in December 2022, and used as materials for full-text search. We also plan to provide the full-text data through the Data Transmission Services for persons with Print Disabilities.

Furthermore, in FY2022, an R&D project to improve the OCR software (NDLOCR) will be implemented under the Supplementary Budget for FY2021 to provide full-text data to visually impaired persons.