2 Development of Japanese OCR software in FY2021
In FY2021, the National Diet Library, Japan, (NDL) utilized funding allocated from the Third Supplementary Budget for FY2020 to implement a project in which Morpho AI Solutions was contracted to develop a machine-learnable optical character recognition (OCR) software for conversion of digitized materials to Japanese text.
1. Purpose of the project
Most of the digitized materials from the NDL made available in image format without text data. Recent progress in OCR processing has made it feasible to create text data from image data and to make the text data available via a full-text search service. Thus, the NDL prioritized a project to convert its digitized materials into text data using OCR and to make them available via a search service.
In 2021, the NDL implemented an OCR text conversion project and acquired full-text data for all materials digitized by 2020. In order to create full-text data for materials digitized during or after 2021, however, we felt it was necessary to have an OCR software incorporating the latest technology, such as machine learning, that was optimized for the NDL's digitized materials and would be available for use at any time.
We felt it would be necessary to optimize recognition accuracy for a diverse range of digitized materials as well as ensure an appropriate processing speed. Therefore we outsourced the research and development of an OCR software that would meet our needs and could also be made available as open source.
2. Materials used
In this project, development was based on approximately 2.47 million books and periodicals captured in 223 million images that were available via the National Diet Library Digital Collections as of December 2020.
Collection | Round number | Number of images |
---|---|---|
Periodicals | 1,320,000 | 72,462,853 |
Books | 973,000 | 137,728,493 |
Total | 2,293,000 | 210,191,346 |
3. Outline of the project
First, the contractor created OCR training datasets by analyzing and processing digitized materials provided by the NDL from the Books and Periodicals indicated in section 2. Next, we conducted research and development of layout recognition technology to enable high-quality text conversion optimized for digitized materials at the NDL as well as structured handling of the output text data.
During the latter half of the R&D period, we evaluated the performance of the developed OCR software and were able to confirm that both of the performance criteria indicated in sections 4.1. and 4.2. as well as the processing speed requirements for the NDL server environment specified in section 4.3. had been met.
Also, concurrently with the work described above, we also developed a machine learning model that automatically assigned layout information for structuring headings, author names, ruby, annotations, page numbers, columns, text, figures, and tables in the text data. The layout information is provided by means of layout recognition performed during OCR processing.
For further information, please see the following document:
4. Predefined performance criteria
4.1. Performance in character recognition
4.1.1. Target materials for the evaluation of character recognition performance
The target materials comprised books and periodicals for which publication dates were known.
They were divided into the following 33 segments, each of which was assigned a criterion.
- Books (20 segments) classified by decade of publication from the 1870s to the 1960s and classified as either Science (classes 4 to 6) or Humanities (other classes) per the Nippon Decimal Classification (NDC) system.
- Periodicals (13 segments) classified by decade of publication from the 1870s to the 1990s.
96.9% of the JIS Kanji characters appearing in the 330 images prepared by the NDL for the performance measurement test were included in JIS Level 1 and 2. (Excluding "々", "仝", "〇", etc.)
4.1.2. Evaluation Method
The recognition performance was evaluated by calculating an F-score for each image based on character units.
The F-score is defined as follows:
$$F_{measure}=\dfrac{2Recall*Precision}{Recall+Precision}$$
where
$$y_{true}=\{The \: multiset \: of \: characters \: in \: the \: correct \: character \: information\}$$
$$y_{pred}=\{The \: multiset \: of \: characters \: in \: the \: recognition \: result\}$$
$$Precision=\dfrac{|y_{pred} \cap y_{true}|}{|y_{pred}|},Recall=\dfrac{|y_{pred} \cap y_{true}|}{|y_{true}|}$$
Thus, the F-score is between 0 and 1, with values closer to 1 indicating higher recognition performance.
4.1.3. Recognition performance criteria
The median F-scores were required to be higher than the criteria indicated below in at 30 of 33 segments, excluding the 3 segments used for reference.
Type | Publication Date | Category | Criteria |
---|---|---|---|
Books | 1870 | Humanities | 0.40 |
Books | 1870 | Science | 0.43 |
Books | 1880 | Humanities | 0.65 |
Books | 1880 | Science | 0.54 |
Books | 1890 | Humanities | 0.73 |
Books | 1890 | Science | 0.60 |
Books | 1900 | Humanities | 0.79 |
Books | 1900 | Science | 0.77 |
Books | 1910 | Humanities | 0.81 |
Books | 1910 | Science | 0.83 |
Books | 1920 | Humanities | 0.89 |
Books | 1920 | Science | 0.89 |
Books | 1930 | Humanities | 0.91 |
Books | 1930 | Science | 0.89 |
Books | 1940 | Humanities | 0.92 |
Books | 1940 | Science | 0.90 |
Books | 1950 | Humanities | 0.94 |
Books | 1950 | Science | 0.94 |
Books | 1960 | Humanities | 0.97 |
Books | 1960 | Science | 0.97 |
Periodicals | 1870 | - | 0.65 |
Periodicals | 1880 | - | 0.74 |
Periodicals | 1890 | - | 0.78 |
Periodicals | 1900 | - | 0.88 |
Periodicals | 1910 | - | 0.84 |
Periodicals | 1920 | - | 0.89 |
Periodicals | 1930 | - | 0.91 |
Periodicals | 1940 | - | 0.91 |
Periodicals | 1950 | - | 0.91 |
Periodicals | 1960 | - | 0.96 |
Periodicals | 1970 | - | (for reference) 0.96 |
Periodicals | 1980 | - | (for reference) 0.95 |
Periodicals | 1990 | - | (for reference) 0.96 |
mean | - | - | (including reference criteria) 0.79 |
mean | - | - | (excluding reference criteria) 0.78 |
N.B. Basis for character recognition performance criteria
The criteria for text recognition performance were defined by the following steps: First, we manually created ground-truth data for 330 images created by sampling 10 images from each of 33 segments. Next, we measured the recognition performance of the following three OCRs in the same way and adopted the median F-score in each category as a criterion. The OCR programs we used were acquired from the websites of the providers as of October 2020.
4.2. Detecting reading direction
4.2.1. Target materials for evaluation detection of reading direction
In principle, all of the materials were evaluated.
4.2.2. Criteria for evaluating detection of reading direction
If the reading direction (vertical or horizontal) of more than half of the character strings in an image was visually determined to be incorrect in the output result with line breaks removed, the reading direction of that image was considered incorrect. If 95% or more of the target images were read in the correct direction, the result was considered acceptable. Therefore, if a sentence or phrase could be read in a way that made sense, we considered the reading direction to have been detected correctly.
4.3. Processing Speed Requirements
The processing time (excluding file input/output) must be less than 2 seconds per page (less than 4 seconds per image) in the NDL server environment shown below.
- OS: Ubuntu 18.04LTS
- Intel(R) Xeon(R) W-3245 CPU @ 3.20GHz 1 unit
- GPU:NVIDIA Geforce RTX 2080Ti 1 unit
5. Achieved character recognition performance
The table below shows the results of the evaluation using the 330-image dataset prepared by the NDL for the performance measurement test.
Of 33 categories, actual performance exceeded the criteria in 32 categories. Since the only category that did not meet the criterion, periodicals published in the 1990s, is a reference value that was not necessary to meet the target, the requirement was met in all 30 categories targeted in the performance evaluation. The evaluation was performed by the NDL using a docker container on the library’s server (environment details are shown in section 4.3). The processing time per page was about 1.5 seconds.
Type | Publication Date | Category | Result | Criteria | Difference |
---|---|---|---|---|---|
Books | 1870 | Humanities | 0.9147 | 0.40 | +0.5147 |
Books | 1870 | Science | 0.9174 | 0.43 | +0.4874 |
Books | 1880 | Humanities | 0.9450 | 0.65 | +0.2950 |
Books | 1880 | Science | 0.9492 | 0.54 | +0.4092 |
Books | 1890 | Humanities | 0.9531 | 0.73 | +0.2231 |
Books | 1890 | Science | 0.9257 | 0.60 | +0.3257 |
Books | 1900 | Humanities | 0.9559 | 0.79 | +0.1659 |
Books | 1900 | Science | 0.9584 | 0.77 | +0.1884 |
Books | 1910 | Humanities | 0.9520 | 0.81 | +0.1420 |
Books | 1910 | Science | 0.9380 | 0.83 | +0.1080 |
Books | 1920 | Humanities | 0.9641 | 0.89 | +0.0741 |
Books | 1920 | Science | 0.9593 | 0.89 | +0.0693 |
Books | 1930 | Humanities | 0.9624 | 0.91 | +0.0524 |
Books | 1930 | Science | 0.9624 | 0.89 | +0.0724 |
Books | 1940 | Humanities | 0.9647 | 0.92 | +0.0447 |
Books | 1940 | Science | 0.9607 | 0.90 | +0.0607 |
Books | 1950 | Humanities | 0.9803 | 0.94 | +0.0403 |
Books | 1950 | Science | 0.9605 | 0.94 | +0.0205 |
Books | 1960 | Humanities | 0.9824 | 0.97 | +0.0124 |
Books | 1960 | Science | 0.9743 | 0.97 | +0.0043 |
Periodicals | 1870 | - | 0.9290 | 0.65 | +0.2790 |
Periodicals | 1880 | - | 0.9411 | 0.74 | +0.2011 |
Periodicals | 1890 | - | 0.9404 | 0.78 | +0.1604 |
Periodicals | 1900 | - | 0.9525 | 0.88 | +0.0725 |
Periodicals | 1910 | - | 0.9285 | 0.84 | +0.0885 |
Periodicals | 1920 | - | 0.9514 | 0.89 | +0.0614 |
Periodicals | 1930 | - | 0.9534 | 0.91 | +0.0434 |
Periodicals | 1940 | - | 0.9626 | 0.91 | +0.0526 |
Periodicals | 1950 | - | 0.9548 | 0.91 | +0.0448 |
Periodicals | 1960 | - | 0.9846 | 0.96 | +0.0246 |
Periodicals | 1970 | - | 0.9693 | 0.96 | +0.0093 |
Periodicals | 1980 | - | 0.9605 | 0.95 | +0.0105 |
Periodicals | 1990 | - | 0.9376 | 0.96 | -0.0224 |
平均 | - | - | 0.9548 | 0.79 | +0.1648 |
6. Information on project output
6.1. Publication of output
(1) Japanese OCR software (NDLOCR)
Rights issues necessary to release the software as open source have been resolved. The OSS libraries used are also from permissive OSS (such as MIT, BSD or Apache2), allowing free secondary use for both commercial and non-commercial purposes. The software is divided into seven repositories for each functionality but can be built and used as a Docker container by following the instructions in the repository below.
(2) Character sets covered by NDLOCR
- Supported character types be the NDLOCR (PDF 418KB) (in Japanese)
- Subsumption list of Kanji in JIS level 3 and 4 kanji sets (text 18KB)
(3) OCR training dataset created through development work (in the public domain)
6.2. Use of OCR software in NDL services
The OCR software developed in this project will be used to create full-text data from materials digitized after 2021. The full-text data thus created will be added to the NDL Digital Collections, which is scheduled to be renewed in December 2022, and used as materials for full-text search. We also plan to provide the full-text data through the Data Transmission Services for persons with Print Disabilities.
Furthermore, in FY2022, an R&D project to improve the OCR software (NDLOCR) will be implemented under the Supplementary Budget for FY2021 to provide full-text data to visually impaired persons.