Adding New Datasets¶

This guide explains how to add support for a new knowledge graph dataset in the current class-based pipeline architecture.

Overview¶

The project now uses one pipeline class per dataset. Pipelines are auto-discovered via subclass registration, so you usually do not edit the CLI when adding a dataset.

Typical steps:

Add parsing/downloading helpers in a dataset module under text_kgc_data/datasets/.
Add a DatasetPipeline subclass in that module.
Ensure the dataset module is imported by text_kgc_data/datasets/__init__.py.
Run text-kgc list-datasets and confirm your dataset appears.

Required Pipeline Methods¶

In your pipeline subclass, implement these methods (the base class raises NotImplementedError):

download
create_entity_id2name
create_entity_id2description
create_relation_id2name
text_config

Minimal example:

from pathlib import Path
from typing import Dict

from ..pipeline import DatasetPipeline
from ..processors import PipelineTextConfig


class MyDatasetPipeline(DatasetPipeline):
    name = "my-dataset"
    description = "My dataset"
    aliases = ("my_dataset",)

    def download(self, output_dir: Path) -> Path:
        # Download raw files into output_dir.
        return output_dir

    def create_entity_id2name(self, data_dir: Path) -> Dict[str, str]:
        # Parse raw files and return {entity_id: entity_name}
        return {}

    def create_entity_id2description(self, data_dir: Path) -> Dict[str, str]:
        # Parse raw files and return {entity_id: description}
        return {}

    def create_relation_id2name(self, data_dir: Path) -> Dict[str, str]:
        # Parse raw files and return {relation_id: relation_name}
        return {}

    def text_config(self) -> PipelineTextConfig:
        return PipelineTextConfig(
            dataset="my-dataset",
            entity_max_words=50,
            relation_max_words=30,
            fill_missing_entities=False,
            fill_missing_relations=False,
            entity_name_fill_value="",
            entity_description_fill_value="",
        )

Optional Pipeline Overrides¶

You can optionally override these methods when needed:

split_source_filename(split):
Default expects train.txt, valid.txt, test.txt.
Override for dataset-specific filenames.
default_download_subdir(), default_process_input(), default_output_subdir():
Customize batch command paths.
include_in_batch_process():
Return False to skip this pipeline in process-all.
postprocess_entity_id2description(...), postprocess_relation_id2name(...):
Override only for dataset-specific behavior beyond text_config.

Internal helper methods prefixed with __ in DatasetPipeline are not extension points.

Discovery and Registration¶

Pipeline classes are registered automatically when the class is imported.

To ensure registration:

Put your pipeline class in text_kgc_data/datasets/my_dataset.py.
Import that module in text_kgc_data/datasets/__init__.py.

No dataset-specific CLI command wiring is needed.

CLI Usage¶

After your pipeline is registered, generic CLI commands work automatically:

text-kgc list-datasets
text-kgc download my-dataset data/raw
text-kgc process my-dataset data/raw/my-dataset data/standardised/my-dataset

Standardized Output Format¶

process_all writes one kg.zarr store containing:

Root metadata attributes:
entities
descriptions
relations
Triplet matrix datasets:
train_triplets
validation_triplets
test_triplets

Each triplet matrix is shape N x 3 with dataset attribute:

columns = ["entity_id", "relation_id", "test_id"]

Verification Checklist¶

Before opening a PR, verify:

text-kgc list-datasets shows your dataset name.
download and process run end-to-end.
kg.zarr contains mapping metadata and the three triplet matrices.
Aliases resolve correctly in CLI (if you set aliases).

Reference Implementations¶

See these current examples:

text_kgc_data/datasets/wn18rr.py
text_kgc_data/datasets/fb15k237.py
text_kgc_data/datasets/wikidata5m.py