TextKGCData: Textual Knowledge Graph Data Toolkit¶
This package provides tools for downloading, processing, standardizing, and loading knowledge graph data with textual descriptions. It includes a command-line interface (CLI) for all major data preparation and preprocessing steps.
Add to Your Git Project¶
git submodule add https://github.com/TJ-coding/TextKGCData.git packages/text-kgc-data
Installation¶
pip install git+https://github.com/TJ-coding/TextKGCData.git@branch#subdirectory=text_kgc_data_proj
CLI Commands¶
All commands are available via the CLI defined in text_kgc_data/cli.py
. Example usage:
python -m text_kgc_data.cli [COMMAND] [OPTIONS]
Download Data¶
- download_text_kgc_dataset
Download the text-based KGC dataset from the SimKGC repository.
python -m text_kgc_data.cli download-text-kgc-dataset --data-dir-name <output_dir>
Standardize WN18RR Data¶
- standardize_wn18rr_entity_files_cli
Standardize WN18RR entity files (IDs, names, descriptions).
python -m text_kgc_data.cli standardize-wn18rr-entity-files-cli \
--definitions-source-path WN18RR/wordnet-mlj12-definitions.txt \
--entity-id-save-path wn18rr_tkg/entity_ids.txt \
--entity-id2name-save-path wn18rr_tkg/entity_id2_name.txt \
--entity-id2description-save-path wn18rr_tkg/entity_id2_description.txt
- standardize_wn18rr_relation_file_cli
Standardize WN18RR relation file (relation IDs to descriptions).
python -m text_kgc_data.cli standardize-wn18rr-relation-file-cli \
--relations-source-path WN18RR/relations.dict \
--relation-id2name-save-path wn18rr_tkg/wn18rr-relations2description.json
Standardize Wikidata5M Data¶
- standardize_wikidata5m_entity_files_cli
Standardize Wikidata5M entity files (IDs, names, descriptions).
python -m text_kgc_data.cli standardize-wikidata5m-entity-files-cli \
--entity-names-source-path wikidata5m/wikidata5m_entity.txt \
--entity-descriptions-source-path wikidata5m/wikidata5m_text.txt \
--entity-id-save-path wikidata5m_tkg/entity_ids.txt \
--entity-id2name-save-path wikidata5m_tkg/entity_id2_name.json \
--entity-id2description-save-path wikidata5m_tkg/entity_id2_description.json
- standardize_wikidata5m_relation_file_cli
Standardize Wikidata5M relation file (relation IDs to names).
python -m text_kgc_data.cli standardize-wikidata5m-relation-file-cli \
--relations-source-path wikidata5m/wikidata5m_relation.txt \
--relation-id2name-save-path wikidata5m_tkg/relation_id2name.json
Preprocessing Utilities¶
- fill_missing_entries_cli
Fill missing entries in entity name/description JSON files with a placeholder.
python -m text_kgc_data.cli fill-missing-entries-cli \
--entity-id2name-path <input_name_json> \
--entity-id2description-path <input_desc_json> \
--output-entity-id2name-path <output_name_json> \
--output-entity-id2description-path <output_desc_json> \
--place-holder-character "-"
- truncate_description_cli
Truncate entity descriptions to a maximum number of tokens using a HuggingFace tokenizer.
python -m text_kgc_data.cli truncate-description-cli \
--entity-id2description-path <input_desc_json> \
--output-entity-id2description-path <output_desc_json> \
--tokenizer-name <hf_tokenizer_name> \
--truncate-tokens 50 \
--batch-size 50000
Project Layout¶
📁 text_kgc_data/
├── 📄 cli.py # Command line interface for all data operations
├── 📄 download_data.py # Downloading data from SimKGC Repo
├── 📄 helpers.py # Helper functions for TSV/JSON handling
├── 📄 preprocessors.py # Data cleaning: fill missing, truncate descriptions
└── 📁 standardise_tkg_files/
├── 📄 standardise_wn18rr.py # Standardize WN18RR dataset
└── 📄 standardise_wikidata5m.py # Standardize Wikidata5M dataset
📁 text-kgc-data-docs/
├── 📄 mkdocs.yml # MkDocs configuration
└── 📁 docs/
├── 📄 index.md # Documentation homepage
└── 📄 ... # Other markdown pages, images, files
Loading Textual KG Files in Python¶
You can load processed textual knowledge graph files using the SimKGCDataLoader
:
from text_kgc_data.tkg_io import load_tkg_from_files
entity_id2name_source_path = "path/to/entity_id2name.json" # Dict[str, str]
entity_id2description_source_path = "path/to/entity_id2description.json" # Dict[str, str]
relation_id2name_source_path = "path/to/relation_id2name.json" # Dict[str, str]
textual_kg = load_tkg_from_files(
entity_id2name_source_path,
entity_id2description_source_path,
relation_id2name_source_path,
)
Saving KG to Files¶
You can save a TextualKG
object to disk using the save_tkg_to_files
function. This will export the entity and relation mappings to JSON files for later use.
from text_kgc_data.tkg_io import save_tkg_to_files
from text_kgc_data.tkg import TextualKG
# Assume `textual_kg` is an instance of TextualKG
save_tkg_to_files(
textual_kg,
"path/to/entity_id2name.json",
"path/to/entity_id2description.json",
"path/to/relation_id2name.json",
)
- Make sure the output paths are writable.
- The saved files can be loaded later using
load_tkg_from_files
.
Notes¶
- All CLI commands support custom input/output paths for flexible workflows.
- Preprocessing utilities help ensure data consistency and compatibility with downstream models.
- See the code in
cli.py
for the latest available commands and options.