Adding New Datasets¶
This guide explains how to add support for new knowledge graph datasets to the TextKGCData toolkit.
Overview¶
The TextKGCData toolkit follows a functional architecture where each dataset has its own module containing specific processing functions. To add a new dataset, you'll need to:
- Create a new dataset module
- Implement the required functions
- Add CLI commands
- Update the main exports
Creating a Dataset Module¶
Create a new file in text_kgc_data/datasets/
named after your dataset (e.g., my_dataset.py
).
Required Functions¶
Each dataset module should implement these core functions:
from pathlib import Path
from typing import Dict
from beartype import beartype
@beartype
def download_my_dataset(output_dir: Path) -> Path:
"""Download dataset files.
Args:
output_dir: Directory where raw data will be saved
Returns:
Path to the downloaded data directory
"""
# Implementation here
pass
@beartype
def create_entity_id2name_my_dataset(raw_data_dir: Path, output_file: Path) -> Dict[str, str]:
"""Create entity ID to name mapping.
Args:
raw_data_dir: Directory containing raw dataset files
output_file: Where to save the mapping JSON file
Returns:
Dictionary mapping entity IDs to names
"""
# Implementation here
pass
@beartype
def create_entity_id2description_my_dataset(raw_data_dir: Path, output_file: Path) -> Dict[str, str]:
"""Create entity ID to description mapping.
Args:
raw_data_dir: Directory containing raw dataset files
output_file: Where to save the mapping JSON file
Returns:
Dictionary mapping entity IDs to descriptions
"""
# Implementation here
pass
@beartype
def create_relation_id2name_my_dataset(raw_data_dir: Path, output_file: Path) -> Dict[str, str]:
"""Create relation ID to name mapping.
Args:
raw_data_dir: Directory containing raw dataset files
output_file: Where to save the mapping JSON file
Returns:
Dictionary mapping relation IDs to names
"""
# Implementation here
pass
Function Naming Convention¶
Functions should be named to clearly describe what they create:
create_entity_id2name_*
- Creates entity ID to name mappingscreate_entity_id2description_*
- Creates entity ID to description mappingscreate_relation_id2name_*
- Creates relation ID to name mappingsdownload_*
- Downloads raw dataset files
Adding CLI Commands¶
Update text_kgc_data/cli.py
to add your dataset commands:
# Add imports
from text_kgc_data.datasets.my_dataset import (
download_my_dataset,
create_entity_id2name_my_dataset,
create_entity_id2description_my_dataset,
create_relation_id2name_my_dataset,
)
# Create subcommand app
my_dataset_app = typer.Typer(help="My Dataset processing commands")
@my_dataset_app.command("download")
@beartype
def download_my_dataset_cli(
output_dir: Path = typer.Argument(..., help="Output directory for raw data"),
) -> None:
"""Download My Dataset files."""
result_dir = download_my_dataset(output_dir)
typer.echo(f"Downloaded My Dataset to: {result_dir}")
# Add more commands...
# Register with main app
app.add_typer(my_dataset_app, name="my-dataset")
Updating Exports¶
Add your functions to text_kgc_data/__init__.py
:
# Add to imports
from text_kgc_data.datasets.my_dataset import (
download_my_dataset,
create_entity_id2name_my_dataset,
create_entity_id2description_my_dataset,
create_relation_id2name_my_dataset,
)
# Add to __all__
__all__ = [
# ... existing exports ...
"download_my_dataset",
"create_entity_id2name_my_dataset",
"create_entity_id2description_my_dataset",
"create_relation_id2name_my_dataset",
]
Testing Your Dataset¶
You can test your dataset functions both programmatically and via CLI:
# Programmatic usage
from text_kgc_data import download_my_dataset, create_entity_id2name_my_dataset
data_dir = download_my_dataset(Path("./data"))
mappings = create_entity_id2name_my_dataset(data_dir, Path("./mappings.json"))
# CLI usage
text-kgc my-dataset download ./data
text-kgc my-dataset create-entity-mappings ./data ./mappings.json
Example: Existing Datasets¶
Look at datasets/wn18rr.py
and datasets/wikidata5m.py
for complete examples of dataset implementations following this pattern.