Skip to content

Adding New Datasets

This guide explains how to add support for a new knowledge graph dataset in the current class-based pipeline architecture.

Overview

The project now uses one pipeline class per dataset. Pipelines are auto-discovered via subclass registration, so you usually do not edit the CLI when adding a dataset.

Typical steps:

  1. Add parsing/downloading helpers in a dataset module under text_kgc_data/datasets/.
  2. Add a DatasetPipeline subclass in that module.
  3. Ensure the dataset module is imported by text_kgc_data/datasets/__init__.py.
  4. Run text-kgc list-datasets and confirm your dataset appears.

Required Pipeline Methods

In your pipeline subclass, implement these methods (the base class raises NotImplementedError):

  • download
  • create_entity_id2name
  • create_entity_id2description
  • create_relation_id2name
  • text_config

Minimal example:

from pathlib import Path
from typing import Dict

from ..pipeline import DatasetPipeline
from ..processors import PipelineTextConfig


class MyDatasetPipeline(DatasetPipeline):
    name = "my-dataset"
    description = "My dataset"
    aliases = ("my_dataset",)

    def download(self, output_dir: Path) -> Path:
        # Download raw files into output_dir.
        return output_dir

    def create_entity_id2name(self, data_dir: Path) -> Dict[str, str]:
        # Parse raw files and return {entity_id: entity_name}
        return {}

    def create_entity_id2description(self, data_dir: Path) -> Dict[str, str]:
        # Parse raw files and return {entity_id: description}
        return {}

    def create_relation_id2name(self, data_dir: Path) -> Dict[str, str]:
        # Parse raw files and return {relation_id: relation_name}
        return {}

    def text_config(self) -> PipelineTextConfig:
        return PipelineTextConfig(
            dataset="my-dataset",
            entity_max_words=50,
            relation_max_words=30,
            fill_missing_entities=False,
            fill_missing_relations=False,
            entity_name_fill_value="",
            entity_description_fill_value="",
        )

Optional Pipeline Overrides

You can optionally override these methods when needed:

  • split_source_filename(split):
  • Default expects train.txt, valid.txt, test.txt.
  • Override for dataset-specific filenames.
  • default_download_subdir(), default_process_input(), default_output_subdir():
  • Customize batch command paths.
  • include_in_batch_process():
  • Return False to skip this pipeline in process-all.
  • postprocess_entity_id2description(...), postprocess_relation_id2name(...):
  • Override only for dataset-specific behavior beyond text_config.

Internal helper methods prefixed with __ in DatasetPipeline are not extension points.

Discovery and Registration

Pipeline classes are registered automatically when the class is imported.

To ensure registration:

  1. Put your pipeline class in text_kgc_data/datasets/my_dataset.py.
  2. Import that module in text_kgc_data/datasets/__init__.py.

No dataset-specific CLI command wiring is needed.

CLI Usage

After your pipeline is registered, generic CLI commands work automatically:

text-kgc list-datasets
text-kgc download my-dataset data/raw
text-kgc process my-dataset data/raw/my-dataset data/standardised/my-dataset

Standardized Output Format

process_all writes one kg.zarr store containing:

  • Root metadata attributes:
  • entities
  • descriptions
  • relations
  • Triplet matrix datasets:
  • train_triplets
  • validation_triplets
  • test_triplets

Each triplet matrix is shape N x 3 with dataset attribute:

  • columns = ["entity_id", "relation_id", "test_id"]

Verification Checklist

Before opening a PR, verify:

  1. text-kgc list-datasets shows your dataset name.
  2. download and process run end-to-end.
  3. kg.zarr contains mapping metadata and the three triplet matrices.
  4. Aliases resolve correctly in CLI (if you set aliases).

Reference Implementations

See these current examples:

  • text_kgc_data/datasets/wn18rr.py
  • text_kgc_data/datasets/fb15k237.py
  • text_kgc_data/datasets/wikidata5m.py