WN18RR Processing Guide¶
Command Reference¶
Command | Purpose |
---|---|
text-kgc wn18rr download |
Download raw WN18RR dataset |
text-kgc wn18rr process |
Complete SimKGC-compatible processing |
text-kgc wn18rr create-entity-text |
Create entity name/description mappings |
text-kgc wn18rr create-relation-text |
Create relation name mappings |
text-kgc wn18rr process-pipeline |
Complete pipeline with options |
text-kgc wn18rr fill-missing-entries |
Fill missing entity entries |
text-kgc wn18rr truncate-descriptions |
Truncate descriptions to word limit |
Batch Commands (All Datasets)¶
Command | Purpose |
---|---|
text-kgc download-all |
Download all datasets (WN18RR + FB15k-237 + Wikidata5M) |
text-kgc process-all |
Process all datasets with SimKGC compatibility |
text-kgc download-and-process-all |
Complete pipeline for all datasets |
Quick Start¶
Single Dataset (WN18RR only):
text-kgc wn18rr download data/raw/wn18rr
text-kgc wn18rr process data/raw/wn18rr data/standardised/wn18rr
All Datasets (Recommended):
text-kgc download-and-process-all
Step-by-Step Processing¶
1. Download Dataset
text-kgc wn18rr download data/raw/wn18rr
2. Create Entity Text
text-kgc wn18rr create-entity-text \
data/raw/wn18rr/wordnet-mlj12-definitions.txt \
data/standardised/wn18rr
3. Create Relation Text
text-kgc wn18rr create-relation-text \
data/raw/wn18rr/relations.dict \
data/standardised/wn18rr
4. Pipeline (Alternative)
text-kgc wn18rr process-pipeline \
data/raw/wn18rr \
data/standardised/wn18rr \
--fill-missing \
--truncate-descriptions \
--max-words 50
Python Usage¶
from text_kgc_data.io import load_standardized_kg
# Load all WN18RR data at once
wn18rr_data = load_standardized_kg("data/standardised/wn18rr")
# Access the data
entities = wn18rr_data['entities'] # Entity ID -> name
descriptions = wn18rr_data['descriptions'] # Entity ID -> description
relations = wn18rr_data['relations'] # Relation ID -> name
Or load individual files:
from text_kgc_data.io import load_json
# Load individual files manually
entity_id2name = load_json("data/standardised/wn18rr/entity_id2name.json")
entity_id2description = load_json("data/standardised/wn18rr/entity_id2description.json")
relation_id2name = load_json("data/standardised/wn18rr/relation_id2name.json")
Preprocessing Details for Academic Papers¶
WN18RR Dataset Specification¶
- Source: WordNet 18 - Reduced Relations
- Entities: 40,943 unique synsets
- Relations: 11 semantic relations
- Splits: 86,835 train / 3,034 validation / 3,134 test triplets
Text Processing Methodology¶
Entity Name Cleaning:
- Removes __
prefix from WordNet synset identifiers
- Strips POS tags and sense numbers (e.g., _NN_1
suffix)
- Example transformation: __dog_NN_1
→ dog
Text Truncation:
- Method: Word-based truncation (not subword tokenization)
- Implementation: text.split()[:max_words]
followed by ' '.join()
- Entity descriptions: 50 words maximum
- Relation descriptions: 30 words maximum
- Rationale: Ensures consistent text lengths across tokenizers
Missing Data Handling:
- Strategy: Empty string (''
) for missing descriptions
- No artificial placeholder tokens introduced
- Maintains data structure consistency
Text Sources:
- Entity descriptions: wordnet-mlj12-definitions.txt
- Entity names: Derived from synset identifiers after cleaning
- Relation names: relations.dict
with underscore-to-space conversion
Technical Specifications: - Character encoding: UTF-8 - Tokenizer compatibility: BERT-base-uncased (default) - Output format: Standardized JSON mappings + plain text entity lists - SimKGC compatibility: Full preprocessing pipeline alignment
Citation Notes: This preprocessing follows SimKGC methodology (Wang et al., 2022). Word-based truncation ensures reproducibility across different tokenization schemes. For academic use, specify: "WN18RR entity descriptions truncated to 50 words, relation descriptions to 30 words using word-based splitting."
Paper-Ready Summary¶
Copy-paste for Methods section:
WN18RR Dataset Preprocessing: We process the WN18RR dataset using SimKGC-compatible preprocessing following Wang et al. (2022). The dataset contains 40,943 entities and 11 relations with 86,835/3,034/3,134 train/validation/test triplets. Entity names are derived from WordNet synset identifiers by removing the __
prefix and POS tag suffixes (e.g., __dog_NN_1
→ dog
). Entity descriptions are sourced from the WordNet-MLJ12 definitions provided by Yao et al. (2019) (https://arxiv.org/abs/1909.03193) and truncated to 50 words using word-based splitting. Relation names are truncated to 30 words with underscores converted to spaces. Missing descriptions are represented as empty strings to maintain consistent data structure.