Dataset split

Commands

To split the default graph using the random scheme, use:

openbiolink split rand --edges graph_files/edges.csv --tn-edges graph_files/TN_edges.csv --nodes graph_files/nodes.csv

For a list of arguments, use:

openbiolink split rand --help

Splitting can also be done by time with

openbiolink split time

File description

Default file name	Description	Column descriptions
train_sample.csv	All positive samples from the training set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
test_sample.csv	All positive samples from the test set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
val_sample.csv	All positive samples from the validation set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
negative_train_sample.csv	All negative samples from the training set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
negative_test_sample.csv	All negative samples from the test set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
negative_val_sample.csv	All negative samples from the validation set	Node 1 ID, Edge type, Node 2 ID, Quality score, TP/TN, Source
train_val_nodes.csv	All nodes present in the training and validation set combined	Node ID, Node type
test_nodes.csv	All nodes present in the test set	Node ID, Node typ
removed_test_nodes.csv	All nodes which got removed from the test set, due to not being present in the trainingset	Node ID
removed_val_nodes.csv	All nodes which got removed from the validation set, due to not being present in the trainingset	Node ID

Random split

In the random split setting, first, negative sampling is performed. Afterwards, the whole dataset (containing positive and negative examples) is split randomly according to the defined ratio. Finally, post-processing steps are performed to facilitate training and to avoid information leakage.

Time-slice split

In the time-slice split setting, for both of the provided time slices, first, negative sampling is performed. Afterwards, the first time slice (t-1 graph) is used as training sample, while the difference between the first and the second time slice serves as the test set. Finally, post-processing steps are performed to facilitate training and to avoid information leakage.

Generally, the time slice setting is trickier to implement than the random split strategy, as it requires more manual evaluation and knowledge of the data. One of the most difficult factors is the change of the source databases over time. For example, a database might change its quality score, or even its ID-format. Also, the number of relationships stored might increase sharply due to new mapping files being used. This might also result in ‘vanishing edges’, where edges that were present in the t-1 graph are no longer existent in the current graph. Although the OpenBioLink toolbox tries to assist the user with different kinds of warnings to identify such difficulties in the data, it is unfortunately not possible to automatically detect nor solve all these problems, making some manual pre- and post-processing of the data inevitable.

Post-processing

To facilitate model application

Edges that contain nodes that are not present in the training set are dropped from the test set. This facilitates use of embedding-based models that usually cannot make predictions for nodes that have not been embedded during training.

Avoiding train-test information leakage and trivial predictions in the test set

Removal of reverse edges If the graph is directed, reverse edges are removed from the training set. The reason for this is that if the original edge a-b was undirected, both directions a→b and a←b are materialized in the directed graph. If one of these directed edges would be present in the training set and one in the test set, the prediction would be trivial. Therefore, in these cases, the reverse edges from the training set are removed. (Note that edges are removed from the training set instead of the test set because this is advantagous for maintaining the train-test-set ratio)
Removal of super-properties Some types of edges have sub-property characteristics, meaning that relationship x indicates a generic interaction between two entities (e.g. _protein_interaction_protein_), while relationship y further describes this relationship in more detail (e.g., _protein_activation_protein_). This means that the presence of x between two nodes does not imply the existence of a relation y between those same entities, but the presence of y necessarily implies the existence of x. These kinds of relationships could cause information leakage in the datasets, therefore super-relations of relations present in the training set are removed from the test set.