Skip to content

WN18RR Processing Guide

Command Reference

Command Purpose
text-kgc wn18rr download Download raw WN18RR dataset
text-kgc wn18rr process Complete SimKGC-compatible processing
text-kgc wn18rr create-entity-text Create entity name/description mappings
text-kgc wn18rr create-relation-text Create relation name mappings
text-kgc wn18rr process-pipeline Complete pipeline with options
text-kgc wn18rr fill-missing-entries Fill missing entity entries
text-kgc wn18rr truncate-descriptions Truncate descriptions to word limit

Batch Commands (All Datasets)

Command Purpose
text-kgc download-all Download all datasets (WN18RR + FB15k-237 + Wikidata5M)
text-kgc process-all Process all datasets with SimKGC compatibility
text-kgc download-and-process-all Complete pipeline for all datasets

Quick Start

Single Dataset (WN18RR only):

text-kgc wn18rr download data/raw/wn18rr
text-kgc wn18rr process data/raw/wn18rr data/standardised/wn18rr

All Datasets (Recommended):

text-kgc download-and-process-all


Step-by-Step Processing

1. Download Dataset

text-kgc wn18rr download data/raw/wn18rr

2. Create Entity Text

text-kgc wn18rr create-entity-text \
  data/raw/wn18rr/wordnet-mlj12-definitions.txt \
  data/standardised/wn18rr

3. Create Relation Text

text-kgc wn18rr create-relation-text \
  data/raw/wn18rr/relations.dict \
  data/standardised/wn18rr

4. Pipeline (Alternative)

text-kgc wn18rr process-pipeline \
  data/raw/wn18rr \
  data/standardised/wn18rr \
  --fill-missing \
  --truncate-descriptions \
  --max-words 50


Python Usage

from text_kgc_data.io import load_standardized_kg

# Load all WN18RR data at once
wn18rr_data = load_standardized_kg("data/standardised/wn18rr")

# Access the data
entities = wn18rr_data['entities']          # Entity ID -> name  
descriptions = wn18rr_data['descriptions']  # Entity ID -> description
relations = wn18rr_data['relations']        # Relation ID -> name

Or load individual files:

from text_kgc_data.io import load_json

# Load individual files manually
entity_id2name = load_json("data/standardised/wn18rr/entity_id2name.json")
entity_id2description = load_json("data/standardised/wn18rr/entity_id2description.json")
relation_id2name = load_json("data/standardised/wn18rr/relation_id2name.json")

Preprocessing Details for Academic Papers

WN18RR Dataset Specification

  • Source: WordNet 18 - Reduced Relations
  • Entities: 40,943 unique synsets
  • Relations: 11 semantic relations
  • Splits: 86,835 train / 3,034 validation / 3,134 test triplets

Text Processing Methodology

Entity Name Cleaning: - Removes __ prefix from WordNet synset identifiers - Strips POS tags and sense numbers (e.g., _NN_1 suffix) - Example transformation: __dog_NN_1dog

Text Truncation: - Method: Word-based truncation (not subword tokenization) - Implementation: text.split()[:max_words] followed by ' '.join() - Entity descriptions: 50 words maximum - Relation descriptions: 30 words maximum - Rationale: Ensures consistent text lengths across tokenizers

Missing Data Handling: - Strategy: Empty string ('') for missing descriptions - No artificial placeholder tokens introduced - Maintains data structure consistency

Text Sources: - Entity descriptions: wordnet-mlj12-definitions.txt - Entity names: Derived from synset identifiers after cleaning - Relation names: relations.dict with underscore-to-space conversion

Technical Specifications: - Character encoding: UTF-8 - Tokenizer compatibility: BERT-base-uncased (default) - Output format: Standardized JSON mappings + plain text entity lists - SimKGC compatibility: Full preprocessing pipeline alignment

Citation Notes: This preprocessing follows SimKGC methodology (Wang et al., 2022). Word-based truncation ensures reproducibility across different tokenization schemes. For academic use, specify: "WN18RR entity descriptions truncated to 50 words, relation descriptions to 30 words using word-based splitting."


Paper-Ready Summary

Copy-paste for Methods section:

WN18RR Dataset Preprocessing: We process the WN18RR dataset using SimKGC-compatible preprocessing following Wang et al. (2022). The dataset contains 40,943 entities and 11 relations with 86,835/3,034/3,134 train/validation/test triplets. Entity names are derived from WordNet synset identifiers by removing the __ prefix and POS tag suffixes (e.g., __dog_NN_1dog). Entity descriptions are sourced from the WordNet-MLJ12 definitions provided by Yao et al. (2019) (https://arxiv.org/abs/1909.03193) and truncated to 50 words using word-based splitting. Relation names are truncated to 30 words with underscores converted to spaces. Missing descriptions are represented as empty strings to maintain consistent data structure.