OpenBioLink2020

The OpenBioLink2020 Dataset is a highly challenging benchmark dataset containing over 5 million positive and negative edges. The test set does not contain trivially predictable, inverse edges from the training set and does contain all different edge types, to provide a more realistic edge prediction scenario.

OpenBioLink2020: directed, high quality is the default dataset that should be used for benchmarking purposes. To allow anayzing the effect of data quality as well as the directionality of the evaluation graph, four variants of OpenBioLink2020 are provided – in directed and undirected setting, with and without quality cutoff.

Additionally, each graph is available in RDF N3 format (without train-validation-test splits).

Download

All datasets are hosted on zenodo.

OpenBioLink2020: directed, high quality // RDF (default dataset for benchmarking)
OpenBioLink2020: undirected, high quality // RDF
OpenBioLink2020: directed, no quality cutoff // RDF
OpenBioLink2020: undirected, no quality cutoff // RDF

Leaderboard

	Model	MRR	h@1	h@10

Latent	RESCAL	.320	.212	.544
	TransE	.280	.175	.500
	DistMult	.300	.193	.521
	ComplEx	.319	.211	.547
	ConvE	.288	.186	.510
	RotatE	.286	.180	.511

Interpretable	AnyBURL (Maximum)	.277	.192	.457
	AnyBURL (Noisy-OR)	.159	.098	.295
	SAFRAN*	.306	.214	.501

If you want to see your results added to the Leaderboard please create a new issue.

Summary

Dataset	Train	Test	Valid	Entities	Relations
directed, high quality	8.503.580	401.901	397.066	184.732	28
undirected, high quality	7.559.921	372.877	357.297	184.722	28
directed, no quality cutoff	51.636.927	2.079.139	2.474.921	486.998	32
undirected, no quality cutoff	41.383.093	2.010.662	1.932.436	486.998	32