Linked Data Mining Challenge

Know@LOD 2014 will host the second edition of the Linked Data Mining Challenge (the first having been at DMoLD2013).

Note: Deadlines from both the challenge result submission and associated papers are different from those of the regular Know@LOD 2014 papers, see the timeline below.

Prizes

The best result in the predictive task will be awarded by a licence to RapidMiner Studio Professional Edition (with catalog price about $3000), thanks to the LDMC sponsor RapidMiner, Inc.
http://www.rapidminer.com/The best LDMC paper will be awarded by an Amazon voucher worth 500 EUR, thanks to the LDMC sponsor EU project LOD2

General Overview of the Challenge

Linked data represents a novel type of data source that has been so far nearly untouched by advanced data mining methods. It breaks down many traditional assumptions on source data and thus represents a number of challenges:

While the individual published datasets typically follow a relatively regular, relational-like (or hierarchical, in the case of taxonomic classification) structure, the presence of semantic links among them makes the resulting ‘hyper-dataset’ akin to general graph datasets. On the other hand, compared to graphs such as social networks, there is a larger variety of link types in the graph.
The datasets have been published for entirely different purposes, such as statistical data publishing based on legal commitment of government bodies vs. publishing of encyclopedic data by internet volunteers vs. data sharing within a researcher community. This introduces further data modeling heterogeneity and uneven degree of completeness and reliability.
The amount and diversity of resources as well as their link sets is steadily growing, which allows for inclusion of new linked datasets into the mining dataset nearly on the fly, at the same time, however, making the feature selection problem extremely hard.

The Linked Data Mining Challenge (LDMC) will consist of two tracks, each with a different domain and dataset. It is possible to participate in a single track or in both tracks.

Track A addresses linked government data, more specifically, the public procurement domain. Data from this domain are frequently analyzed by investigative journalists and ‘transparency watchdog’ organizations; these, however, 1) rely on interactive tools such as OLAP and spreadsheets, incapable of spotting hidden patterns, and 2) only deal with isolated datasets, thus ignoring the potential of interlinking to external datasets. LDMC could possibly initiate a paradigm shift in analytical processing of this kind of data, eventually leading to large-scale benefits to the citizenship. It is also likely to spur the research collaboration between the Semantic Web community (represented by the linked data sub-community as its practice-oriented segment) and the Data Mining community.

Track B addresses the domain of scientific research collaboration, in particular cross-disciplinary collaboration. While collaboration between people within the same community often emerges naturally, many possible cross-disciplinary collaborations never form due to a lack of awareness of cross-boundary synergies. Finding meaningful patterns in collaborations can help reveal potential cross-disciplinary collaborations that might otherwise have remained hidden.

Each track requires the participants to download a real-world RDF dataset and accomplish at least one pre-defined task on it using their own or publicly available data mining tool. Partial mapping to external datasets is also available, which will allow for extraction of further features from the Linked Open Data cloud in order to augment the core dataset.
The best participant in each track will be awarded. The ranking of the participants will be made by the LDMC evaluation panels, and will take into account both the quality of the submitted LDMC paper and the prediction quality measure in the predictive task (Track A only, if addressed by the participant).

Track A Overview

The dataset comes from US public contracts sources (details below), RDFized within the LOD2 project. The two tasks are characterized as follows:

Task A1 concerns the prediction of the number of tenders for the respective public contract; the true value of this target attribute will only be known after the bidding period has been closed (i.e. after the LDMC result submission deadline).
Task A2 amounts to unrestricted discovery of interesting nuggets of any sort in the (augmented) dataset. The results will be evaluated for interestingness and novelty by a panel of experts knowledgeable in the field; the most valuable results should be such those potentially igniting discussions on transparency or unexpected economic consequences of certain procurement segments.

For both tasks, the participants will submit a paper describing the methods and techniques used and summarizing the results. In addition, for Task A1 they will submit the predicted results on evaluation data, in a prescribed format.
Given the late availability of "ground-truth" results for Task A1 and the complex process of fully-blown panel-based evaluation, the mere acceptance of LDMC papers to the Know@LOD program and proceedings will be made via a rapid and relatively lenient review process. However, the space allocated to the individual presentations will be adjusted according to the complete evaluation results. The complete evaluation results for both tracks will be summarized in overview papers that will also be part of the Know@LOD proceedings.

Track B Overview

The dataset comes from different research metadata sources in Australia. It contains data about researchers and datasets they produced, the latter are organized by the geographic region in which they were collected, and the time when they were created. The most relevant
elements of the Task B dataset are:

A class structure of geospatial locations by zone where membership is inferred by the longitude and latitude coordinates within the metadata
A class structure of temporal location where membership is inferred by year of data collection based on metadata start and end dates
Subscription to the Australian and New Zealand Research Classification (ANZRC) FoR and SEO codes ontology, which provides a rich classification by research areas
A rougher classification into STEM (science, technology, engineering, and mathematics) and HASS (Humanities and Social Sciences) research, based on FoR codes
Data description
Researcher details
Project details, which are used to link researchers through transitive relationships, based on links with projects
Additional keywords

Task B is completely designed as an unrestricted task of discovering interesting knowledge nuggets, just as task A2. Examples for interesting findings might include, but are not limited to

analyzing topical changes over time, and discovering hot topics
analyzing differences in dataset coverage between the two climate regions of Australia covered in the dataset (tropics and temperate zone)
finding general outliers, such as datasets or researchers with exceptional properties

Source Data, Required Results, and Evaluation

Track A, Task A1
The dataset is available for download here: training data, test data. It consists of training data for learning the predictive models (this data contains the value of the target attribute) and testing data for evaluating the models (this data is without the target attribute). The target attribute to be predicted is the value of the property pc: numberOfTenders (the pc prefix referring to the Public Contracts Ontology).

A detailed description of Track A data (referring to its original sources, additional linked data, used vocabularies, size, and licence information) is available here.

The participants have to submit the achieved results on testing data, i.e. the predicted number of tenders. The results have to be delivered in a (syntactically correct) CSV format with two columns:
The first column will contain the URI of an annotated public contract (instance of pc:Contract),
The second column will contain the predicted number of tenders for the public contract in the format of positive integer.
A sample of CSV result file for Task A1, with a header and one content row, is as follows:

contract,numberOfTenders
"http://linked.opendata.cz/resource/domain/fbo.gov/contract/AG-02NV-S-14-7000",13

The principal evaluation measure at the level of individual object will be a specific kind of error rate, calculated as the absolute value of the difference between the predicted value and the reference value, adjusted by the reciprocal value of the (smaller) value size and normalized to [0,1] by a sigmoidal function:

Beside the CSV file containing the predictions, the participants have to submit a paper describing the used methods and techniques. If the same participant also addresses Task A2, there should be a single paper covering both tasks (internally divided into more content sections).

Track A, Task A2
The dataset is the same as the training dataset for Task A1. A detailed description of Track A data (referring to its original sources, additional linked data, used vocabularies, size, and licence information) is available here.
The participants only have to submit a paper describing the used methods and techniques, as well as the results obtained, i.e., the hypotheses perceived as interesting either by the computational method or by the participants themselves.
The paper will be evaluated by the evaluation panel. It should meet the standard norms of scientific writing.

Track B
The dataset is available here. For convenience, all data collected from the different sources has been fused into a single file, except for the ANZRC codes, which are imported from the web of data using an owl:import statement. The details on dataset construction can be found in this paper.

The participants are expected to submit a paper describing the used methods and techniques, as well as the results obtained, i.e., the hypotheses perceived as interesting either by the computational method or by the participants themselves.
The paper will be evaluated by the evaluation panel, both with respect to the soundness and originality of the methods used, as well as for the validity of the hypotheses and nuggets found. It should meet the standard norms of scientific writing.

Submission Procedure

Both results and papers will be submitted to EasyChair. (Note that the EasyChair instance is different from that for the standard Know@LOD papers!) The following rules apply:

The predicted results are submitted under the EasyChair topic Results A1
The papers are submitted either under the EasyChair topic Papers A or Papers B (depending on the track addressed)
If a paper refers to the previously submitted predicted results, the list of authors must be the same for both submissions, and the abstract of the paper must contain the submission number of the previous predicted results submission.

The papers have to be formatted according to the LNCS guidelines and have at most

6 pages if only one task is addressed
8 pages if both tasks of Track A are addressed.

Note that submissions related to different tracks (A and B) have to be made separately! However, if similar methods are used, e.g., for track A and B, it is OK to copy the respective sections into both papers (to ease the evaluation, each paper should be self-contained and understandable without a reference to a paper in the other track).

For any questions related to the submission procedure, please address the contact persons below.

Timeline

26 February 2014: Data for both LDMC tracks available for download
31 March 2014: Submission deadline for predictive task results
3 April 2014: Submission deadline for LDMC papers
10 April 2014: Notification of acceptance for LDMC papers
15 April 2014: Camera-ready versions of papers
15 May 2014: Complete evaluation results (both quantitative and panel-based) published, and LDMC session schedule finalized
25 May 2014: LDMC session held as part of Know@LOD

Organization

Contact persons
Track A:

Vojtěch Svátek, University of Economics, Prague, svatek (at) vse.cz
Jindřich Mynarz, University of Economics, Prague, jindrich.mynarz (at) vse.cz

Track B:

Heiko Paulheim, University of Mannheim, Germany, heiko (at) informatik.uni-mannheim.de

Evaluation panels
Track A:

Martin Nečaský, Charles University Prague, Czech Republic
Anders Pedersen, Open Knowledge Foundation, UK
Jose María Alvarez-Rodríguez, Carlos III University of Madrid, Spain
Jiří Skuhrovec, zIndex, Czech Republic
Krzysztof Wecel, Poznań University of Economics, Poland

Track B:

Trina Myers, James Cook University, Australia
Sören Auer, Fraunhofer Institute, Bonn, Germany