Nathalie Lambeng

and 3 more

AbstractMost research intensive institutions provide some form of data management support. However, the form in which these services are offered and how extensive these are differ and are often difficult to compare. Objective comparison of the different types of services is needed to evaluate the effectiveness of the diverse approaches and to make informed decisions about their usefulness. In this practice paper, we discuss a collaborative effort between Delft University of Technology (TU Delft), École Polytechnique Fédérale de Lausanne (EPFL), University of Cambridge and University of Illinois, which resulted in the development of a short survey to assist institutions in increasing the effectiveness of their data management support services and their evolution. Different approaches to a common goal Informal discussions between the research data service teams of TU Delft, EPFL, University of Cambridge and University of Illinois revealed that each institution had undertaken a different approach in designing their data support services. TU Delft has a central research data support team at the Library1, which is also part of a consortium of four Dutch technical universities (4TU)2. In addition, TU Delft is embarking on a Data Stewardship project, which will provide disciplinary support for data management embedded at faculties (Teperek et al., “Data Stewardship – addressing disciplinary data management needs”, abstract submitted to IDCC18). EPFL has a central data management support team, which provides generic, as well as on-demand, tailored training and data consultations to the research community3. This team is also assisted by liaison librarians, who know the data management needs of their faculties and help the central support to shape their service to meet disciplinary requirements. EPFL is also an active player in the national Digital Lifecycle Management project4. The University of Cambridge, in addition to small central team supporting researchers in data management, also has a dedicated programme of Data Champions - researchers volunteering their time to advocate for good data management in their local communities5. The University of Illinois has a central data management support team and is also part of a national network of subject-specific data curators6. Despite these differences, the goal of the four service providers is the same: to improve data management practice within their research communities. How can we therefore compare how good our approaches are towards achieving our common goal? Evaluation of existing measures Members of the four institutions first reviewed existing tools to assess data management support services. We first looked at the Research Infrastructure Self Evaluation Framework (RISE) framework survey created by the Digital Curation Centre7. However, we thought that this framework was more suitable for assessing the maturity of the data services offered. We then looked at the Data Asset Framework (DAF) used by several UK institutions8. The DAF survey is a comprehensive tool that allows institutions to assess researchers’ data management practice and identify gaps in service provisions; thus in principle, it should meet our requirements. However, the DAF survey consists of over sixty questions, which was not compatible with the repeated assessment we plan to do. We therefore decided to follow its general principle, but do something simpler and less resource-intensive. Short survey on data management practice Based on the DAF survey, we came up with a list of ten multiple choice questions that we found essential to reflect on researchers’ data management practice. By limiting the number of questions to ten and by ensuring these were multiple choice, we thought that first we were respectful of researchers’ time, and secondly, the approach would allow for results standardisation and comparison. In addition to a commonly agreed set of questions, each institution was able to add their own specific questions to obtain more granular information about the different research units and to get feedback about specific services provided to their research communities. Anticipated outcomes TU Delft and EPFL will launch the survey in October 2017, and will be followed by the University of Cambridge and the University of Illinois. We anticipate that the first comparative results will be available at the beginning of 2018. We expect that the results of the survey will provide a useful initial assessment of current data management practices across research communities, which will highlight to institutions where biggest gaps are and where more work is needed. The results will help understand the different disciplinary needs and the maturity of subject-specific data management practice, thus, allowing a more targeted approach. In addition, comparing the results between the institutions will hopefully highlight strengths and weaknesses of the different approaches they took in developing their data management support and will hopefully lead to best practice exchange. Limitations As with any other methodology based on surveys, there are limitations to our approach, which will affect the type of conclusions that can be drawn. First, the respondents will be self-selected, and therefore may not be representative of the research communities we are trying to sample. Secondly, institutions need to be cautious interpreting potentially different results for diverse groups of respondents as these might not be directly related to the quality or availability of data support services and might be affected by external factors, such as community norms, specific funders’ policies, influence of local authorities etc. Finally, the limited number of questions used in the survey limits the depth of possible conclusions about data management practices.Nonetheless, we believe that the benefits of our lightweight data management practice assessment, make the approach worth testing. Next steps The initial results, expected in early 2018, will allow us to evaluate whether the survey allows for comparative assessment of data management practice. If the survey proves to be suitable for such measurements, we will continue to use it to regularly evaluate the maturity of researchers’ data management practice at our respective institutions. Additionally, we plan to share the survey under a CC BY licence to enable others to use the tool for their assessments and to allow comparisons and collaborations with other institutions. References 1.         Research Data Management. TU Delft Available at: https://www.tudelft.nl/library/themaportalen/research-data-management/. (Accessed: 19th October 2017) 2.         4TU.ResearchData: Home. Available at: http://researchdata.4tu.nl/home/. (Accessed: 19th October 2017) 3.         ResearchData | EPFL. Available at: https://researchdata.epfl.ch/. (Accessed: 19th October 2017) 4.         Home :: DLCM. Available at: https://www.dlcm.ch/. (Accessed: 19th October 2017) 5.         Higman, R., Teperek, M. & Kingsley, D. Creating a Community of Data Champions. bioRxiv 104661 (2017). doi:10.1101/104661 6.         Johnston, L. R. et al. Data Curation Network: A Cross-Institutional Staffing Model for Curating Research Data. (2017). 7.         RISE, a self-start tool for research data management service review | Digital Curation Centre. Available at: http://www.dcc.ac.uk/news/rise-self-start-tool-research-data-management-service-review. (Accessed: 15th October 2017) 8.         Rob Johnson, Tom Parsons & Andrea Chiarelli. Jisc Data Asset Framework Toolkit 2016. (Zenodo, 2016). doi:10.5281/zenodo.177876

Jan Krause

and 11 more

ScopeThis document consist of a textual sketch aiming for each question of the SNSF-DMP to introduce technical elements for the answer:in the form of  one or two generic sentences that researches can customize,or in the form of several generic elements that can be selected by researchers depending on their projects.Responsiblity: Aude.Swiss National Science Foundation Data Management Plan: DMP Canevas for the mySNF formThis canevas has been based on the SNF guidelines http://www.snf.ch/SiteCollectionDocuments/DMP_content_mySNF-form_en.pdf and several concrete examples of DMPs including more specifically the ones available on the DCC and UNIGE websites: http://www.dcc.ac.uk/resources/data-management-plans/guidance-exampleshttps://www.unige.ch/researchdata/en/make-plan/all/dmp-fns/http://www.gla.ac.uk/media/media_418166_en.pdhttp://www.gla.ac.uk/media/media_441723_en.pdf1. Data collection and documentation1.1 What data will you collect, observe, generate or re-use?Questions you might want to consider:- What type, format and volume of data will you collect, observe, generate or reuse?- Which existing data (yours or third-party) will you reuse?Briefly describe the data you will collect, observe or generate. Also mention any existing data that will be (re)used. The descriptions should include the type, format and content of each dataset. Furthermore, provide an estimation of the volume of the generated datasets.(This relates to the FAIR Data Principles F2, I3, R1 & R1.2)Examples of answerThe data produced from this research project will fall into two categories:1. The various reaction parameters required for optimisation of the chemical transformation.2. The spectroscopic and general characterisation data of all compounds produced during the work.We anticipate that the data produced in category 1 will amount to approximately 10 MB and the data produced in category 2 will be in the range of 4 - 5 GB.***This project will work with and generate three main types of raw data.1. Images from transmitted-light microscopy of giemsa-stained squashed larval brains.2. Images from confocal microscopy of immunostained whole-mounted larval brains.3. Western blot data.All data will be stored in digital form, either in the format in which it was originally generated (i.e. Metamorph files, for confocal images; Spectrum Mill files, for mass spectra with results of mass spectra analyses stored in Excel files; tiff file s for gel images; Filemaker Pro files for genetics records), or will be converted into a digital form via scanning to create tiff or jpeg files (e.g. western blots or other types of results).Measurements and quantification of the images will be recorded in spreadsheets. Micrograph data is expected to total between 100GB and 1TB over the course of the project. Scanned images of western blots are expected to total around 1GB over the course of the project. Other derived data (measurements and quantifications) are not expected to exceed 10MB.1.2 How will the data be collected, observed or generated?Questions you might want to consider:- What standards, methodologies or quality assurance processes will you use?- How will you organize your files and handle versioning?Explain how the data will be collected, observed or generated. Describe how you plan to control and document the consistency and quality of the collected data: calibrationprocesses, repeated measurements, data recording standards, usage of controlled vocabularies, data entry validation, data peer review, etc.Discuss how the data management will be handled during the project, mentioning for example naming conventions, version control and folder structures. (This relates to the FAIR Data Principle R1)Examples of answerThe reaction conditions will be recorded and collated using Excel spreadsheets and named according to each generation of reaction.The various experimental procedures and associated compound characterisation will be written up using the Royal Society of Chemistry standard formatting in a Word document. The associated NMR spectra will be collated in chronological order in a.pdf document.These are standard practices for synthetic methodology projects.***All samples on which data are collected will be prepared according to published standard protocols in the field. Files will be named according to a pre-agreed convention. The dataset will be accompanied by a README file which will describe the directory hierarchy and file naming convention.Each directory will contain an INFO.txt file describing the experimental protocol used in that experiment. It will also record any deviations from the protocol and other useful contextual information.Microscope images capture and store a range of metadata (field size, magnification, lens phase, zoom, gain, pinhole diameter etc.) with each image.This should allow the data to be understood by other members of our research group and add contextual value to the dataset should it be reused in the future.1.3 What documentation and metadata will you provide with the data?Questions you might want to consider:- What information is required for users (computer or human) to read and interpret the data in the future?- How will yougenerate this documentation?- What community standards (if any) will be used to annotate the (meta)data?Describe all types of documentation (README files, metadata, etc.) you will provide to help secondary users to understand and reuse your data. Metadata should at least include basic details allowing other users (computer or human) to find the data. This includes at least a name and a persistent identifier for each file, the name of the person who collected or contributed to the data, the date of collection and the conditions to access the data.Furthermore, the documentation may include details on the methodology used, information about the performed processing and analytical steps, variable definitions, references to vocabularies used, as well as units of measurement.Wherever possible, the documentation should follow existing community standards and guidelines. Explain how you will prepare and share this information. (This relates to the FAIR Data Principles I1, I2, I3, R1, R1.2 & R1.3)Examples of answerThe data will be accompanied by the following contextual documentation, according to standard practice for synthetic methodology projects:1. Spreadsheet documents which detail the reaction conditions.2. Text files which detail the experimental procedures and compound characterisation.Files and folders will be named according to a pre-agreed convention. The final dataset as deposited in the institutional data repository will also be accompanied by a README file listing the contents of the other files and outlining the file-naming convention used.***Metadata will be tagged in XML using the Data Documentation Initiative (DDI) format. The codebook will contain information on study design, sampling methodology, fieldwork, variable-level detail, and all information necessary for a secondary analyst to use the data accurately and effectively.It will be the responsibility of each researcher to annotate their data with metadata, and it will be the responsibility of the Principal Investigator to check weekly (during the field season, monthly otherwise) with all participants to assure data is being properly processed, documented, and stored.All the datasets produced by the project will be published under a GNU licence.2. Ethics, legal and security issues2.1 How will ethical issues be addressed and handled?Questions you might want to consider:- What is the relevant protection standard for your data? Are you bound by a confidentiality agreement?- Do you have the necessary permission to obtain, process, preserve and share the data? Have the people whose data you are using been informed or did they give their consent?- What methods will you use to ensure the protection of personal or other sensitive data?Ethical issues in research projects demand for an adaptation of research data management practices, e.g. how data is stored, who can access/reuse the data and how long the data is stored. Methods to manage ethical concerns may include: anonymization of data; gain approval by ethics committees; formal consent agreements. You should outline that all ethical issues in your project have been identified, including the corresponding measures in data management. (This relates to the FAIR Data Principle A1)Examples of answerThere are no ethical issues in the generation of results from a synthetic methodology project. There are no human subject or samples involved***This project will generate data designed to study the prevalence and correlates of DSM III-R psychiatric disorders and patterns and correlates of service utilization for these disorders in a nationally representative sample of over 8000 respondents. The sensitive nature of these data will require that the data be released through a restricted use contract.2.2 How will data access and security be managed?Questions you might want to consider:- What are the main concerns regarding data security, what are the levels of risk and what measures are in place to handle security risks?- How will you regulate data access rights/permissions to ensure the security of the data?- How will personal or other sensitive data be handled to ensure safe data storage and transfer?If you work with personal or other sensitive data you should outline the security measures in order to protect the data. Please list formal standards which will be adopted in your study. An example is ISO 27001-Information security management. Furthermore, describe the main processes or facilities for storage and processing of personal or other sensitive data. (This relates to the FAIR Data Principle A1)Examples of answerThe data will be processed and managed in a secure non-networked environment using virtual desktop technology.***All interviewees and focus group participants will sign a Consent form agreed to by the School ethics committee. We have guaranteed anonymity to our interviewees and focus group participants. Therefore we will not be depositing .wav files as this would compromise that guarantee. However, anonymised transcripts of the interviews and focus groups will be deposited. We will make sure consent forms make provision for future sharing of data. All identifying information will be kept in a locked filing cabinet and not stored with electronic files.2.3 How will you handle copyright and Intellectual Property Rights issues?Questions you might want to consider:- Who will be the owner of the data?- Which licenses will be applied to the data?- What restrictions apply to the reuse of third-party data?Outline the owners of the copyright and Intellectual Property Right (IPR) of all data that will be collected and generated including the licence(s). For consortia, an IPR ownership agreement might be necessary. You should comply with relevant funder, institutional, departmental or group policies on copyright or IPR. Furthermore, clarify what permissions are required should third-party data be re-used. (This relates to the FAIR Data Principles I3 & R1.1)Examples of answerThe research is not expected to lead to patents. IPR issues will be dealt with in line with University of Glasgow policy and necessary guidance will be sought from the IPR and Commercialisation Team.***This project is being carried out in collaboration with an industrial partner. The intellectual property rights are set out in the collaboration agreement. The intellectual property generated from this project will be fully exploited with help from the University of Glasgow's IP and Commercialisation Office. The aim is to patent the final procedure and then publish the work in a research journal.3. Data storage and preservation3.1 How will your data be stored and backed-up during the research?Questions you might want to consider:- What are your storage capacity and where will the data be stored?- What are the back-up procedures?Please mention what the needs are in terms of data storage and where the data will be stored.Please consider that data storage on laptops or hard drives, for example, is risky. Storage through IT teams is safer. If external services are asked for, it is important that this does not conflict with the policy of each entity involved in the project, especially concerning the issue of sensitive data.Please specify your back-up procedure (frequency of updates, responsibilities, automatic/manual process, security measures, etc.)Examples of answerStorage and back up will be in three places:● On Laptop of [Name of Researcher]● On a portable storage device (hard drive)● On the University of Glasgow server.[Name of Researcher] will be responsible for the storage and back up of data. This will be done weekly.Both the laptop and external storage devise will be password protected. The risks are that the computer will be hacked and the external drive stolen. The laptop has anti-virus software installed which is updated daily. The external devise is in a locked cabinet. Moreover the data files will also be password protected. The paper consent forms from interviewees/focus group participants will be kept in a locked cabinet in the university office of [Name of Researcher].***Original notebooks and hardcopies of all NMR and mass spectra are stored in the PI’s laboratory. Additional electronic data will be stored on the PI’s computer, which is backed up daily. Additionally, the laboratory will make use of the PI’s lab server space at Tufts for a second repository of data storage. The PI’s lab has access to up to 1 terabyte of information storage at Tufts, which can be expanded if needed.All the project data will be stored using Tufts University Information Technology resources.  The storage is backed up to LTO-4 tape on a daily and weekly basis and stored offsite at Iron Mountain facilities.3.2 What is your data preservation plan?Questions you might want to consider:- What procedures would be used to select data to be preserved?- What file formats will be used for preservation?Please specify which data will be retained, shared and archived after the completion of the project and the corresponding data selection procedure (e.g. long-term value, potential value for re-use, obligations to destroy some data, etc.). Please outline a long-term preservation plan for the datasets beyond the lifetime of the project.In particular, comment on the choice of file formats and the use of community standards.Examples of answerData will be stored for a minimum of three years beyond award period, per NSF guidelines.  If inventions or new technologies are made in connection data, access to data will be restricted until invention disclosures and/or provisional patent filings are made with the EPFL Technical Transfer Office (TTO).***We will preserve the data for 10 years on university servers and also deposit the data in a UK data archive. We will deposit our data at the end of the project in an appropriate data depository e.g. UK Data Archive and/or the Consortium of European Social Science Data Archives. There is no additional cost. Where possible we will store files in open archival formats e.g. word files converted to .txt files and excel files converted to .csv. Where this is not possible we will include information on software used and version number.4. Data sharing and reuse4.1 How and where will the data be shared?Questions you might want to consider- On which repository do you plan to share your data?- How will potential users find out about your data?Consider how and on which repository the data will be made available. The methods applied to data sharing will depend on several factors such as the type, size, complexity and sensitivity of data.Please also consider how the reuse of your data will be valued and acknowledged by other researchers.(This relates to the FAIR Data Principles F1, F3, F4, A1, A1.1, A1.2 & A2)Examples of answerSome of the ongoing data will be shared on [Researcher1]’s Github page (results and code from NetLogo, data from twitter searches).For all other data we will use the University of Glasgow Institutional Data Repository, 'Enlighten: Research Data', where we can direct interested parties, rather than handling each request individually.***Datasets from this work which underpin a publication will be deposited in Enlighten: Research Data, the University of Glasgow’s institutional data repository, and made public at the time of publication. Data in the repository will be stored in accordance with funder and University data policies. Files deposited in Enlighten: Research Data will be given a Digital Object Identifier (DOI) and the associated metadata will be listed in the University of Glasgow Research Data Registry and the DataCite metadata store. The retention schedule for data in Enlighten: Research Data will be 10 years from date of deposition in the first instance, with extensions applied to datasets which are subsequently accessed. This complies with both University of Glasgow guidance and funder policies.Enlighten: Research Data is backed by commercial digital storage wich is audited on a twice yearly basis for compliance with the ISO27001 Information Security Management standard.The DOI issued to datasets in the repository can be included as part of a data citation in publications, allowing the datasets underpinning a publication to be identified and accessed.Metadata about datasets held in the University Registry will be publicly searchable and discoverable and will indicate how and on what terms the dataset can be accessed.4.2 Are there any necessary limitations to protect sensitive data?Questions you might want to consider:- Under which conditions will the data be made available (timing of data release, reason for delay if applicable)?Data have to be shared as soon as possible, but at the latest at the time of publication of the respective scientific output.Restrictions may be only due to legal, ethical, copyright, confidentiality or other clauses.Consider whether a non-disclosure agreement would give sufficient protection for confidential data.(This relates to the FAIR Data Principles A1 & R1.1)Examples of answerData which underpins any publication will be made available at the time of publication.All unpublished data will be deposited in a data repository 12 months after the end of the award.***Astronomical data will be diffused but under an embargo of one year for priority of exploitation reasons.***Personal data will be anonymized before diffusion based on the recommendations from the Commission nationale de l'informatique et des libertés (CNIL).***Data will be made available under Creative Commons License CC-BY.For more licensing options, please feel free to explore this website: https://creativecommons.org/share-your-work/