FB15k-237 Processing Guide¶
Command Reference¶
Command | Purpose |
---|---|
text-kgc fb15k237 download |
Download raw FB15k-237 dataset |
text-kgc fb15k237 process |
Complete SimKGC-compatible processing |
text-kgc fb15k237 create-entity-text |
Create entity name/description mappings |
text-kgc fb15k237 create-relation-text |
Create relation name mappings |
text-kgc fb15k237 process-pipeline |
Complete pipeline with options |
text-kgc fb15k237 fill-missing-entries |
Fill missing entity entries |
text-kgc fb15k237 truncate-descriptions |
Truncate descriptions to word limit |
Quick Start¶
Single Dataset (FB15k-237 only):
text-kgc fb15k237 download data/raw/fb15k237
text-kgc fb15k237 process data/raw/fb15k237 data/standardised/fb15k237
All Datasets (Recommended):
text-kgc download-and-process-all
Step-by-Step Processing¶
1. Download Dataset
text-kgc fb15k237 download data/raw/fb15k237
2. Create Entity Text
text-kgc fb15k237 create-entity-text \
data/raw/fb15k237/FB15k_mid2description.txt \
data/standardised/fb15k237
3. Create Relation Text
text-kgc fb15k237 create-relation-text \
data/raw/fb15k237/relations.dict \
data/standardised/fb15k237
4. Pipeline (Alternative)
text-kgc fb15k237 process-pipeline \
data/raw/fb15k237 \
data/standardised/fb15k237 \
--fill-missing \
--truncate-descriptions \
--max-words 50
Python Usage¶
from text_kgc_data.tkg_io import load_tkg_from_files
textual_fb15k237_kg = load_tkg_from_files(
"data/standardised/fb15k237/entity_id2name.json",
"data/standardised/fb15k237/entity_id2description.json",
"data/standardised/fb15k237/relation_id2name.json"
)
Preprocessing Details for Academic Papers¶
FB15k-237 Dataset Specification¶
- Source: Freebase Knowledge Graph (filtered subset)
- Entities: 14,541 unique entities
- Relations: 237 semantic relations
- Splits: 272,115 train / 17,535 validation / 20,466 test triplets
Text Processing Methodology¶
Entity Name Cleaning:
- Removes namespace prefixes from Freebase entity identifiers
- Converts underscores to spaces for readability
- Example transformation: /m/02mjmr
→ entity name from mid2name mapping
Relation Name Processing:
- Removes namespace prefixes (e.g., /base/
, /people/
)
- Converts forward slashes to spaces
- Deduplicates consecutive identical tokens
- Example transformation: /people/person/nationality
→ nationality person people
Text Truncation:
- Method: Word-based truncation (not subword tokenization)
- Implementation: text.split()[:max_words]
followed by ' '.join()
- Entity descriptions: 50 words maximum
- Relation descriptions: 10 words maximum (FB15k-237 specific)
- Rationale: Ensures consistent text lengths across tokenizers
Missing Data Handling:
- Strategy: Empty string (''
) for missing descriptions
- No artificial placeholder tokens introduced
- Maintains data structure consistency
Text Sources:
- Entity descriptions: FB15k_mid2description.txt
- Entity names: FB15k_mid2name.txt
- Relation names: Derived from relation identifiers with cleaning
Technical Specifications: - Character encoding: UTF-8 - Tokenizer compatibility: BERT-base-uncased (default) - Output format: Standardized JSON mappings + plain text entity lists - SimKGC compatibility: Full preprocessing pipeline alignment
Citation Notes: This preprocessing follows SimKGC methodology (Wang et al., 2022). Word-based truncation ensures reproducibility across different tokenization schemes. For academic use, specify: "FB15k-237 entity descriptions truncated to 50 words, relation descriptions to 10 words using word-based splitting."
Paper-Ready Summary¶
Copy-paste for Methods section:
FB15k-237 Dataset Preprocessing: We process the FB15k-237 dataset using SimKGC-compatible preprocessing following Wang et al. (2022). The dataset contains 14,541 entities and 237 relations with 272,115/17,535/20,466 train/validation/test triplets. Entity names and descriptions are sourced from Freebase mid-to-name and mid-to-description mappings. Entity descriptions are truncated to 50 words and relation names to 10 words using word-based splitting. Relation names undergo namespace cleaning by removing prefixes like /people/
and converting forward slashes to spaces. Missing descriptions are represented as empty strings to maintain consistent data structure.