摘要：In Canonical Workflow Framework for Research (CWFR) “packages” are relevant in two different directions. In data science, workflows are in general being executed on a set of files which have been aggregated for specific purposes, such as for training a model in deep learning. We call this type of “package” a data collection and its aggregation and metadata description is motivated by research interests. The other type of “packages” relevant for CWFR are supposed to represent workflows in a self-describing and self-contained way for later execution. In this paper, we will review different packaging technologies and investigate their usability in the context of CWFR. For this purpose, we draw on an exemplary use case and show how packaging technologies can support its realization. We conclude that packaging technologies of different flavors help on providing inputs and outputs for workflow steps in a machine-readable way, as well as on representing a workflow and all its artifacts in a self-describing and self-contained way
摘要：The FAIR principles have been accepted globally as guidelines for improving data-driven science and data management practices, yet the incentives for researchers to change their practices are presently weak. In addition, data-driven science has been slow to embrace workflow technology despite clear evidence of recurring practices. To overcome these challenges, the Canonical Workflow Frameworks for Research (CWFR) initiative suggests a large-scale introduction of self-documenting workflow scripts to automate recurring processes or fragments thereof. This standardised approach, with FAIR Digital Objects as anchors, will be a significant milestone in the transition to FAIR data without adding additional load onto the researchers who stand to benefit most from it. This paper describes the CWFR approach and the activities of the CWFR initiative over the course of the last year or so, highlights several projects that hold promise for the CWFR approaches, including Galaxy, Jupyter Notebook, and RO Crate, and concludes with an assessment of the state of the field and the challenges ahead.
摘要：There is a huge gap between (1) the state of workflow technology on the one hand and the practices in the many labs working with data driven methods on the other and (2) the awareness of the FAIR principles and the lack of changes in practices during the last 5 years. The CWFR concept has been defined which is meant to combine these two intentions, increasing the use of workflow technology and improving FAIR compliance. In the study described in this paper we indicate how this could be applied to machine learning which is now used by almost all research disciplines with the well-known effects of a huge lack of repeatability and reproducibility. Researchers will only change practices if they can work efficiently and are not loaded with additional tasks. A comprehensive CWFR framework would be an umbrella for all steps that need to be carried out to do machine learning on selected data collections and immediately create a comprehensive and FAIR compliant documentation. The researcher is guided by such a framework and information once entered can easily be shared and reused. The many iterations normally required in machine learning can be dealt with efficiently using CWFR methods. Libraries of components that can be easily orchestrated using FAIR Digital Objects as a common entity to document all actions and to exchange information between steps without the researcher needing to understand anything about PIDs and FDO details is probably the way to increase efficiency in repeating research workflows. As the Galaxy project indicates, the availability of supporting tools will be important to let researchers use these methods. Other as the Galaxy framework suggests, however, it would be necessary to include all steps necessary for doing a machine learning task including those that require human interaction and to document all phases with the help of structured FDOs.
摘要：The overall expectation of introducing Canonical Workflow for Experimental Research and FAIR digital objects (FDOs) can be summarised as reducing the gap between workflow technology and research practices to make experimental work more efficient and improve FAIRness without adding administrative load on the researchers. In this document, we will describe, with the help of an example, how CWFR could work in detail and improve research procedures. We have chosen the example of “experiments with human subjects” which stretches from planning an experiment to storing the collected data in a repository. While we focus on experiments with human subjects, we are convinced that CWFR can be applied to many other data generation processes based on experiments. The main challenge is to identify repeating patterns in existing research practices that can be abstracted to create CWFR. In this document, we will include detailed examples from different disciplines to demonstrate that CWFR can be implemented without violating specific disciplinary or methodological requirements. We do not claim to be comprehensive in all aspects, since these examples are meant to prove the concept of CWFR.
摘要：This special issue is on Canonical Workflow Frameworks for Research (CWFR). A workflow refers to a sequence of activities, which may be more or less computer-based, used with regularity in the research process. CWFR aim to identify common patterns in such scientifically motivated workflows and to offer libraries of components based on FAIR Digital Objects as the integrative standard. Such CWFR components can be reusable independent of particular technologies, benefitting researchers in their daily work by making recurring activities more efficient, using automated workflow methods that would immediately create FAIR compliant data without adding burden.
摘要：Much research is dependent on Information and Communication Technologies (ICT). Researchers in different research domains have set up their own ICT systems (data labs) to support their research, from data collection (observation, experiment, simulation) through analysis (analytics, visualisation) to publication. However, too frequently the Digital Objects (DOs) upon which the research results are based are not curated and thus neither available for reproduction of the research nor utilization for other (e.g., multidisciplinary) research purposes. The key to curation is rich metadata recording not only a description of the DO and the conditions of its use but also the provenance – the trail of actions performed on the DO along the research workflow. There are increasing real-world requirements for multidisciplinary research. With DOs in domain#2;specific ICT systems (silos), commonly with inadequate metadata, such research is hindered. Despite wide agreement on principles for achieving FAIR (findable, accessible, interoperable, and reusable) utilization of research data, current practices fall short. FAIR DOs offer a way forward. The paradoxes, barriers and possible solutions are examined. The key is persuading the researcher to adopt best practices which implies decreasing the cost (easy to use autonomic tools) and increasing the benefit (incentives such as acknowledgement and citation) while maintaining researcher independence and flexibility.
摘要：Data Science (DS) as defined by Jim Gray is an emerging paradigm in all research areas to help finding non-obvious patterns of relevance in large distributed data collections. “Open Science by Design” (OSD), i.e., making artefacts such as data, metadata, models, and algorithms available and re-usable to peers and beyond as early as possible, is a pre-requisite for a flourishing DS landscape. However, a few major aspects can be identified hampering a fast transition: (1) The classical “Open Science by Publication” (OSP) is not sufficient any longer since it serves different functions, leads to non-acceptable delays and is associated with high curation costs. Changing data lab practices towards OSD requires more fundamental changes than OSP. 2) The classical publication-oriented models for metrics, mainly informed by citations, will not work anymore since the roles of contributors are more difficult to assess and will often change, i.e., other ways for assigning incentives and recognition need to be found. (3) The huge investments in developing DS skills and capacities by some global companies and strong countries is leading to imbalances and fears by different stakeholders hampering the acceptance of Open Science (OS). (4) Finally, OSD will depend on the availability of a global infrastructure fostering an integrated and interoperable data domain—“one data-domain” as George Strawn calls it—which is still not visible due to differences about the technological key pillars. OS therefore is a need for DS, but it will take much more time to implement it than we may have expected.
摘要：Coming from an institute that was devoted to analysing data streams of different sorts from its beginning to understand how the human brain is processing language and how language is supporting cognition, building efficient data infrastructures of different scope was a key to research excellence. While first local infrastructures were sufficient, it became apparent in the 90s that local data would not be sufficient anymore to satisfy all research needs. It was a logical step to first take responsibilities in setting up the specific DOBES (Dokumentation bedrohter Sprachen) infrastructure focussing on languages of the world, then the community#2;wide CLARIN RI (European Research Infrastructure for Language Resources and Technology) and later the cross-disciplinary EUDAT data infrastructure [1,2,3]. Realising the huge heterogeneity in data practices, it was also a logical step to start the Research Data Alliance (RDA)  as a truly bottom-up initiative to discuss harmonisation across disciplines and across borders. On this background, determined by always looking for concrete results, the European Open Science Cloud (EOSC) process had Kafka-esc characteristics to me, despite the many interactions I had with EOSC key persons and other colleagues involved. Talking at a level where the technological core remained widely absent was difficult to do for me. Due to Jean-Claude Burgelman’s (JCB) excellent paper I finally understood that excluding the discussions about the core was the only chance to get EOSC accepted. Of course, the discussions about the EOSC core would have to happen at a certain moment and obviously eternal types of disputes would determine these discussions. Therefore, the fallback on the analogy with Greek tragedies was an excellent idea by JCB.
摘要：Data-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded. Since about 80% of the time in data#2;intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of data—our methods do not scale. Therefore experts worldwide are looking for strategies and methods that have a potential for the future. The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers (PIDs) and metadata (MD). In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use. It is argued, however, that assigning PIDs is just the first step. If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed, we are close to defining Digital Objects (DOs) which could indeed indicate a solution to solve some of the basic problems in data management and processing. In addition to standardizing the way we assign PIDs, metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations. We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry. A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.