OpenBioLink2020
The OpenBioLink2020 Dataset is a highly challenging benchmark dataset containing over 5 million positive and negative edges. The test set does not contain trivially predictable, inverse edges from the training set and does contain all different edge types, to provide a more realistic edge prediction scenario.
OpenBioLink2020: directed, high quality is the default dataset that should be used for benchmarking purposes. To allow anayzing the effect of data quality as well as the directionality of the evaluation graph, four variants of OpenBioLink2020 are provided – in directed and undirected setting, with and without quality cutoff.
Additionally, each graph is available in RDF N3 format (without train-validation-test splits).
Download
All datasets are hosted on zenodo.
OpenBioLink2020: directed, high quality // RDF (default dataset for benchmarking)
Leaderboard
Model |
MRR |
h@1 |
||
---|---|---|---|---|
Latent |
RESCAL |
.320 |
.212 |
.544 |
TransE |
.280 |
.175 |
.500 |
|
DistMult |
.300 |
.193 |
.521 |
|
ComplEx |
.319 |
.211 |
.547 |
|
ConvE |
.288 |
.186 |
.510 |
|
RotatE |
.286 |
.180 |
.511 |
|
Interpretable |
AnyBURL (Maximum) |
.277 |
.192 |
.457 |
AnyBURL (Noisy-OR) |
.159 |
.098 |
.295 |
|
SAFRAN* |
.306 |
.214 |
.501 |
If you want to see your results added to the Leaderboard please create a new issue.
Summary
Dataset |
Train |
Test |
Valid |
Entities |
Relations |
---|---|---|---|---|---|
directed, high quality |
8.503.580 |
401.901 |
397.066 |
184.732 |
28 |
undirected, high quality |
7.559.921 |
372.877 |
357.297 |
184.722 |
28 |
directed, no quality cutoff |
51.636.927 |
2.079.139 |
2.474.921 |
486.998 |
32 |
undirected, no quality cutoff |
41.383.093 |
2.010.662 |
1.932.436 |
486.998 |
32 |