Skip to content

TextKGCData: Textual Knowledge Graph Data Toolkit

This package provides tools for downloading, processing, standardizing, and loading knowledge graph data with textual descriptions. It includes a command-line interface (CLI) for all major data preparation and preprocessing steps.


Add to Your Git Project

git submodule add https://github.com/TJ-coding/TextKGCData.git packages/text-kgc-data

Installation

pip install git+https://github.com/TJ-coding/TextKGCData.git@branch#subdirectory=text_kgc_data_proj

CLI Commands

All commands are available via the CLI defined in text_kgc_data/cli.py. Example usage:

python -m text_kgc_data.cli [COMMAND] [OPTIONS]

Download Data

  • download_text_kgc_dataset

Download the text-based KGC dataset from the SimKGC repository.

python -m text_kgc_data.cli download-text-kgc-dataset --data-dir-name <output_dir>

Standardize WN18RR Data

  • standardize_wn18rr_entity_files_cli

Standardize WN18RR entity files (IDs, names, descriptions).

python -m text_kgc_data.cli standardize-wn18rr-entity-files-cli \
  --definitions-source-path WN18RR/wordnet-mlj12-definitions.txt \
  --entity-id-save-path wn18rr_tkg/entity_ids.txt \
  --entity-id2name-save-path wn18rr_tkg/entity_id2_name.txt \
  --entity-id2description-save-path wn18rr_tkg/entity_id2_description.txt
  • standardize_wn18rr_relation_file_cli

Standardize WN18RR relation file (relation IDs to descriptions).

python -m text_kgc_data.cli standardize-wn18rr-relation-file-cli \
  --relations-source-path WN18RR/relations.dict \
  --relation-id2name-save-path wn18rr_tkg/wn18rr-relations2description.json

Standardize Wikidata5M Data

  • standardize_wikidata5m_entity_files_cli

Standardize Wikidata5M entity files (IDs, names, descriptions).

python -m text_kgc_data.cli standardize-wikidata5m-entity-files-cli \
  --entity-names-source-path wikidata5m/wikidata5m_entity.txt \
  --entity-descriptions-source-path wikidata5m/wikidata5m_text.txt \
  --entity-id-save-path wikidata5m_tkg/entity_ids.txt \
  --entity-id2name-save-path wikidata5m_tkg/entity_id2_name.json \
  --entity-id2description-save-path wikidata5m_tkg/entity_id2_description.json
  • standardize_wikidata5m_relation_file_cli

Standardize Wikidata5M relation file (relation IDs to names).

python -m text_kgc_data.cli standardize-wikidata5m-relation-file-cli \
  --relations-source-path wikidata5m/wikidata5m_relation.txt \
  --relation-id2name-save-path wikidata5m_tkg/relation_id2name.json

Preprocessing Utilities

  • fill_missing_entries_cli

Fill missing entries in entity name/description JSON files with a placeholder.

python -m text_kgc_data.cli fill-missing-entries-cli \
  --entity-id2name-path <input_name_json> \
  --entity-id2description-path <input_desc_json> \
  --output-entity-id2name-path <output_name_json> \
  --output-entity-id2description-path <output_desc_json> \
  --place-holder-character "-"
  • truncate_description_cli

Truncate entity descriptions to a maximum number of tokens using a HuggingFace tokenizer.

python -m text_kgc_data.cli truncate-description-cli \
  --entity-id2description-path <input_desc_json> \
  --output-entity-id2description-path <output_desc_json> \
  --tokenizer-name <hf_tokenizer_name> \
  --truncate-tokens 50 \
  --batch-size 50000

Project Layout

📁 text_kgc_data/
├── 📄 cli.py              # Command line interface for all data operations
├── 📄 download_data.py    # Downloading data from SimKGC Repo
├── 📄 helpers.py          # Helper functions for TSV/JSON handling
├── 📄 preprocessors.py    # Data cleaning: fill missing, truncate descriptions
└── 📁 standardise_tkg_files/            
    ├── 📄 standardise_wn18rr.py      # Standardize WN18RR dataset
    └── 📄 standardise_wikidata5m.py  # Standardize Wikidata5M dataset
📁 text-kgc-data-docs/
├── 📄 mkdocs.yml    # MkDocs configuration
└── 📁 docs/
    ├── 📄 index.md  # Documentation homepage
    └── 📄 ...       # Other markdown pages, images, files

Loading Textual KG Files in Python

You can load processed textual knowledge graph files using the SimKGCDataLoader:

from text_kgc_data.tkg_io import load_tkg_from_files

entity_id2name_source_path = "path/to/entity_id2name.json"  # Dict[str, str]
entity_id2description_source_path = "path/to/entity_id2description.json" # Dict[str, str]
relation_id2name_source_path = "path/to/relation_id2name.json" # Dict[str, str]

textual_kg = load_tkg_from_files(
    entity_id2name_source_path,
    entity_id2description_source_path,
    relation_id2name_source_path,
)

Saving KG to Files

You can save a TextualKG object to disk using the save_tkg_to_files function. This will export the entity and relation mappings to JSON files for later use.

from text_kgc_data.tkg_io import save_tkg_to_files
from text_kgc_data.tkg import TextualKG

# Assume `textual_kg` is an instance of TextualKG
save_tkg_to_files(
  textual_kg,
  "path/to/entity_id2name.json",
  "path/to/entity_id2description.json",
  "path/to/relation_id2name.json",
)
  • Make sure the output paths are writable.
  • The saved files can be loaded later using load_tkg_from_files.

Notes

  • All CLI commands support custom input/output paths for flexible workflows.
  • Preprocessing utilities help ensure data consistency and compatibility with downstream models.
  • See the code in cli.py for the latest available commands and options.