SUB Alto importer
This importer is a special case of Mets/Alto. Here, most information is only stored in ALTO.xml files, so only the Pages are in alto format. It was developed to handle OCR newspaper data in the format provided by the Hamburg State Library (SUB).
SUB Custom classes
This module contains the definition of SUB importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in the SUB version of the Mets/Alto format to a unified canonical format. These classes are subclasses of generic Mets/Alto importer classes.
- class text_preparation.importers.sub.classes.SubNewspaperIssue(issue_dir: IssueDirectory)
Newspaper Issue in SUB (Mets/Alto) format.
All functions defined in this child class are specific to parsing SUB Mets/Alto format.
- Parameters:
issue_dir (SubIssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
hamb_echo-1888-02-01-a).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- alias
Newspaper unique alias (identifier or name).
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
SubNewspaperPageinstances from this issue.- Type:
list
- ppn
PPN identifier from the METS filename.
- Type:
str
- title_ppn
Title-level PPN identifier (PPN without date).
- Type:
str
- title
Newspaper title extracted from METS metadata.
- Type:
str
- mets_file
Path to the METS XML file for this issue.
- Type:
str
- class text_preparation.importers.sub.classes.SubNewspaperPage(_id: str, number: int, filename: str, basedir: str, page_size: tuple[int, int], file_id: str, iiif_img_base_uri: str | None = None, encoding: str = 'utf-8')
Newspaper page in SUB (Mets/Alto) format.
- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
page_size (tuple[int, int]) – Width and height of the page image.
encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.
- id
Canonical Page ID (e.g.
hamb_echo-1888-02-01-a-p0001).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
- filename
Name of the Alto XML page file.
- Type:
str
- basedir
Base directory where Alto files are located.
- Type:
str
- encoding
Encoding of XML file.
- Type:
str
- add_issue(issue: SubNewspaperIssue) None
Add the given SubNewspaperIssue as an attribute for this class.
- Parameters:
issue (SubNewspaperIssue) – Issue this page is from
- parse() None
Parse the page’s Alto XML and extract regions, paragraphs, lines, and tokens.
This method processes the SUB Alto XML document to extract all OCR information and structure it into the canonical page format. It maps OCR component IDs to Content Item IDs and extracts page regions with their coordinates.
The parsed data is stored in the page_data attribute under keys: - “r”: List of page regions with paragraphs, lines, and tokens - “n”: Optional notes about parsing problems (e.g., missing coordinates)
- parse_printspace(element: Tag, mappings: dict[str, str]) tuple[list[dict], list[str]]
Parse the
<PrintSpace>element of an SUB ALTO XML document.This function closely resembles the one inside importers.alto, but slightly adapts to the SUB case, where we have an additional layer of “ComposedBlocks” on top of the “TextBlocks”. The original function could be used if the SubNewspaperPage.parse() method is slightly adapted (potential future work). This element contains all the OCR information about the content items of a page, up to the lowest level of the hierarchy: the regions, paragraphs, lines and tokens, each with their corresponding coordinates.
- Parameters:
element (Tag) – Input XML element (
<PrintSpace>).mappings (dict[str, str]) – Mapping from OCR component ids to their corresponding canonicalw Content Item ID.
- Returns:
- List of page regions in the canonical
format and notes about potential parsing problems.
- Return type:
tuple[list[dict], list[str]]
SUB Detect functions
This module contains helper functions to find SUB OCR data to import.
- text_preparation.importers.sub.detect.SubIssueDir
A light-weight data structure to represent a newspaper issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
provider (str) – Provider for this alias, here always “SUB”
alias (str) – Newspaper alias.
date (datetime.date) – Publication date of issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
>>> from datetime import date >>> i = SubIssueDir( provider='SUB', alias='hamb_echo', date=date(1888, 2, 1), edition='a', path='./SUB/hamb_echo/1888/02/01/Abend-Ausgabe' )
- text_preparation.importers.sub.detect.detect_issues(base_dir: str, alias_filter: list[str] | None = None, exclude_list: list[str] | None = None) list[IssueDirectory]
Detect SUB issues to import within the filesystem.
Traverses the directory structure looking for METS XML files that indicate a valid issue directory. Handles multiple issues per day by assigning editions ‘a’, ‘b’, ‘c’, etc. based on alphabetical order of directory names.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data, this directory should contain directories corresponding to newspaper aliases.
alias_filter (list[str] | None, optional) – Aliases to consider. Defaults to None.
exclude_list (list[str] | None, optional) – Aliases to exclude. Defaults to None.
- Returns:
List of SubIssueDir instances to import.
- Return type:
list[SubIssueDir]
- text_preparation.importers.sub.detect.entry2issue(alias: str, year: str, month: str, entry: dict, base_dir: str) IssueDirectory
Convert a hierarchical JSON entry into a SubIssueDir.
- entry example:
{ “day”: “15”, “edition”: “01”, “local_path”: “…_01” }
- text_preparation.importers.sub.detect.select_issues(base_dir: str, config: dict) list[IssueDirectory] | None
Detect selectively newspaper issues to import.
The behavior is very similar to
detect_issues()with the only difference thatconfigspecifies some rules to filter the data to import. See the configuration documentation for details on filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data, this directory should contain directories corresponding to newspaper aliases.
config (dict) – Configuration dictionary containing ‘titles’, ‘exclude_titles’, and ‘year_only’ keys for filtering.
- Returns:
List of SubIssueDir instances to import.
- Return type:
list[SubIssueDir] | None