Adding New Datasets¶
This guide explains how to add support for a new knowledge graph dataset in the current class-based pipeline architecture.
Overview¶
The project now uses one pipeline class per dataset. Pipelines are auto-discovered via subclass registration, so you usually do not edit the CLI when adding a dataset.
Typical steps:
- Add parsing/downloading helpers in a dataset module under
text_kgc_data/datasets/. - Add a
DatasetPipelinesubclass in that module. - Ensure the dataset module is imported by
text_kgc_data/datasets/__init__.py. - Run
text-kgc list-datasetsand confirm your dataset appears.
Required Pipeline Methods¶
In your pipeline subclass, implement these methods (the base class raises NotImplementedError):
downloadcreate_entity_id2namecreate_entity_id2descriptioncreate_relation_id2nametext_config
Minimal example:
from pathlib import Path
from typing import Dict
from ..pipeline import DatasetPipeline
from ..processors import PipelineTextConfig
class MyDatasetPipeline(DatasetPipeline):
name = "my-dataset"
description = "My dataset"
aliases = ("my_dataset",)
def download(self, output_dir: Path) -> Path:
# Download raw files into output_dir.
return output_dir
def create_entity_id2name(self, data_dir: Path) -> Dict[str, str]:
# Parse raw files and return {entity_id: entity_name}
return {}
def create_entity_id2description(self, data_dir: Path) -> Dict[str, str]:
# Parse raw files and return {entity_id: description}
return {}
def create_relation_id2name(self, data_dir: Path) -> Dict[str, str]:
# Parse raw files and return {relation_id: relation_name}
return {}
def text_config(self) -> PipelineTextConfig:
return PipelineTextConfig(
dataset="my-dataset",
entity_max_words=50,
relation_max_words=30,
fill_missing_entities=False,
fill_missing_relations=False,
entity_name_fill_value="",
entity_description_fill_value="",
)
Optional Pipeline Overrides¶
You can optionally override these methods when needed:
split_source_filename(split):- Default expects
train.txt,valid.txt,test.txt. - Override for dataset-specific filenames.
default_download_subdir(),default_process_input(),default_output_subdir():- Customize batch command paths.
include_in_batch_process():- Return
Falseto skip this pipeline inprocess-all. postprocess_entity_id2description(...),postprocess_relation_id2name(...):- Override only for dataset-specific behavior beyond
text_config.
Internal helper methods prefixed with __ in DatasetPipeline are not extension points.
Discovery and Registration¶
Pipeline classes are registered automatically when the class is imported.
To ensure registration:
- Put your pipeline class in
text_kgc_data/datasets/my_dataset.py. - Import that module in
text_kgc_data/datasets/__init__.py.
No dataset-specific CLI command wiring is needed.
CLI Usage¶
After your pipeline is registered, generic CLI commands work automatically:
text-kgc list-datasets
text-kgc download my-dataset data/raw
text-kgc process my-dataset data/raw/my-dataset data/standardised/my-dataset
Standardized Output Format¶
process_all writes one kg.zarr store containing:
- Root metadata attributes:
entitiesdescriptionsrelations- Triplet matrix datasets:
train_tripletsvalidation_tripletstest_triplets
Each triplet matrix is shape N x 3 with dataset attribute:
columns = ["entity_id", "relation_id", "test_id"]
Verification Checklist¶
Before opening a PR, verify:
text-kgc list-datasetsshows your dataset name.downloadandprocessrun end-to-end.kg.zarrcontains mapping metadata and the three triplet matrices.- Aliases resolve correctly in CLI (if you set
aliases).
Reference Implementations¶
See these current examples:
text_kgc_data/datasets/wn18rr.pytext_kgc_data/datasets/fb15k237.pytext_kgc_data/datasets/wikidata5m.py