Authorea

Awaiting Activation edited bulk upload.md almost 10 years ago

Commit id: bbb07a85e6967ab30c4c8bafba419a30a1cc7615

deletions | additions

[online](https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0Ai_PDCcY5g2JdFN1UDJJdjNsZk9RM0Z6bnFDdlQ0clE&output=html) and can be downloaded in .xls format [spreadsheet](https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0Ai_PDCcY5g2JdFN1UDJJdjNsZk9RM0Z6bnFDdlQ0clE&output=xls). There are three phases for a basic bulk upload of data : 1. Use the web interface * to enter metadata (new sites, citations, treatments, managements) * obtain a template appropriate for your data set. 2. Fill in the template with your data. 3. Use the web interface to upload your data set and insert it into the database. For now, only the steps needed to upload yields data will be outlined in order to make this case simpler. For clarity, in what follows, the term "field" wll be used to refer to the heading used in the uploaded CSV file and the term "column" or "attribute" to refer to an attribute of a yield datum in the yields table of the database. * **Required fields**: 1. Citation * If only one citation for the entire dataset exists, this may be specified interactively by choosing a citation on the citations page * otherwise, specify in the CSV table using either (doi) or (author, year, first_n_characters_of(title)) for some number n; or perhaps (author, year, first_3_words_of(title)) 2. site: use sitename 3. species: use scientificname 4. treatment: use name date: require one of the forms "2003-07-25", "2003-07", or "2003". 5. mean: must be one of the fields of the CSV table, though for uploads of yields data, we will by default call this field "yield" in the provided templates 6. n: required if and only if an SE column is given 7. SE: required if and only if an n column is given; this datum will be inserted into the stat column, and the statname will be set to "SE" access_level Of these, the citation, site, species, treatment, and access_level may be specified interactively when uploading the dataset (if they are uniform for the whole set) rather than appearing as a field or set of fields of the CSV file. As noted above, for citations, this is done outside of the upload wizard by choosing a citation on the citations page. * **Optional fields**: 1. n and SE: as noted above, if one of these is present, the other must be as well; if SE is given, the value will go into the stat column of the yields table, and the statname column will be set to "SE" 2. cultivar: use name; defaults to NULL (for the cultivar_id column) if not provided 3. notes: defaults to the empty string if not provided If a uniform value for the species is provided interactively when uploading the data set, the cultivar may be specified this way as well provided that it also has a uniform value for the whole data set. If n and SE are not given fields of the uploaded CSV file, the value of the n column of the yields table will default to 1 and the stat and statname column values will default to NULL. Automatically set attributes: id: automatically generated by the DBMS dateloc: default based on the format of the date given in the date field (alternative: select value via dropdown for entire dataset) created_at: always set to NOW updated_at: always set to NOW user_id: always set to the currently-logged-in user's id checked: always set to 0 method_id: always set to NULL Other interactive options: The user may set the number of significant digits to round to when inserting data into the "mean" and the "stat" columns (which come from the "yield" and "SE" fields of the CSV file, respectively). The default value is 4 significant digits, but the user may choose any value from 1 to 4. Note that the database does not give any indication of how many digits are significant. Outline of template-download wizard. Possible templates Here are two possible templates (given here as a list of field names) that a user wishing to upload yield data may use. If a value is set for the entire dataset, then those fields are not included in the template. ,site,species,treatment,cultivar,date,yield,n,SE,notes,access_level where may be either citation_doi or citation_author,citation_year,citation_title All templates provided will be use some subset [in rare cases the full set] of these field names Wizard steps 1. Do all the data in your data set pertain to a single citation? If yes, skip to 3. 2. Do you have doi values for all of your citations? 3. Do all of the data in your data set pertain to a single site? 4. Do all of the data in your data set pertain to a single species? 5. Did you use the same treatement for each datum in your data set? 6. Should all of the data in your dataset have the same level of access? 7. Does your data include cultivar information? If no, or if the answer to question 4 was no skip to 9. 8. Do all of the data in your data set pertain to a single cultivar? 9. Is there a single date for all of the data in your data set? 10. Was the sample size greater than 1 for your data points? 11. Do you have notes to include with your data points? Based on the answers to these questions, we build the field list as follows: field_list = ['yield'] If 1 = no if 2 = yes field_list << 'citation_doi' else field_list << ['citation_author,citation_year,citation_title'] [add instructions about only requiring the beginning of the title] If 3 = no field_list << ['site'] If 4 = no field_list << ['species'] If 5 = no field_list << ['treatment'] If 6 = no field_list << ['access_level'] If 7 = yes and 8 = no field_list << ['cultivar'] If 9 = no field_list << ['date'] If 10 = yes field_list << ['n', 'SE'] If 11 = yes field_list << ['notes'] To Dos outline what instructions should be provided with each template (if any) outline the upload phase In particular: the steps of the upload wizard add citation, site, treatment, management, covariate, method as needed what instructions should be provided and where what validation will be done how interactively-provided data should be entered outline modifications needed for trait uploads fields in traits table that are not in yields table: variable_id date_year date_month date_day time timeloc time_hour time_minute entity_id For trait uploads, the variable (identified by name and units) must be provided in the CSV file or (if a single value applies to the whole data set) interactively during the upload process. Contact [David LeBauer](mailto:[email protected]) or [Mike Dietze](mailto:[email protected]) for more information about using this method of data upload.