Preparing Data For Submission


This section of the DDL guide explains best practices for organizing materials before submitting them to the DDL. It includes file naming conventions, removing Personally Identifiable Information (PII), writing data asset and dataset descriptions, and documentation to include in an upload (i.e. codebook, informed consent, other attachments, etc.). More guidance is available here.
Reviewing for Quality
Before uploading datasets to the DDL, review them for quality. Are data types consistent throughout each data column? When you ingest data on the DDL, the platform will identify data types in each column like numbers, text, and true/false (often 0/1 coded data). It will return error messages when data are not consistent throughout a column, for example if the column uses a combination of text and numerical data. It is best to read through your data, looking for inconsistencies in columns and other data quality issues, before submitting them to the DDL.
Codebook
In order to read the data, there must be a good codebook that describes the data accurately and completely. A good codebook does the following:
  • Defines the column headings in the spreadsheet
  • Defines the allowable values for each column
  • Includes any further clarification about the data in each column
Best practice is to create the codebook as a .CSV file where each row describes a different field in the dataset and each column gives information about the field. Following the naming convention discussed in the next section, the codebook might be called Feed_the_Future_Cambodia_Interim_Survey_2017_Codebook.csv
As an example, consider a column in a spreadsheet that records the heights of children taken during a survey or measured during routine visits to a health facility. In this example, the Column Heading is the code used in the data file; the definition provides the age range of the children included in the column, the measurement rules, and the method for recording missing data; and the comment describes the method for measuring participant height by age.
  • Column Heading: HT
  • Definition: The column records the heights of children 0 to 59 months of age measured in centimeters to the nearest tenth of a centimeter. Cells with missing values (children not weighed) are coded as 999.9
  • Comment: Children under 24 months of age were measured lying down using a Quac Stick. Children 24 months of age and older were measured standing flush against the nearest wall.
Frequently during the analysis of a dataset the owner creates new variables. Continuing with the example based on the heights of children, the indicator of choice for a population is the proportion of children whose growth is stunted. Stunting for an individual child aged 0 to 59 months is determined by comparing the child’s height to a standard height for children of the same age where the comparison is expressed as a Z-Score (deviation from the standard). The field containing the Z-Scores might be documented as follows in another row of the _Codebook.csv file for this dataset.
  • Column Heading: HT-AGE
  • Definition: The difference between the child's height and the expected value of the reference population measured as a standard deviation.
  • Comment: The HT-AGE for an individual was calculated using the EPI-INFO program developed by the Centers for Disease Control using the WHO Child Growth Standard.
Choosing Data Asset and Dataset Titles
Data asset and dataset titles are the first impression of the data for a visitor to the platform. These should be brief, precise, and informative. Many partners submit data to the DDL, which creates the possibility of duplicate file names when names are not specific (i.e., Final_Codebook.xlsx).
All file names associated with a data asset should begin with the same phrase referencing the data asset. Since entries are sorted alphabetically by default, this will keep related files together in the catalog and make them easier to find with search. They should also be unique for the data asset in question.
The best rule to follow is to begin with the name of the project, then add the relevant part of the program cycle, the year when the data were collected, and the object type.
In the example below, consider a data asset for a Feed the Future interim survey completed in 2017 in Cambodia with seven datasets, a codebook, a questionnaire, an informed consent form, a disclosure analysis plan, and a readme file. Names might be:
  • Data Asset: Feed_the_Future_Cambodia_Interim_Survey_2017
  • Dataset 1: Feed_the_Future_Cambodia_Interim_Survey_2017_Household_Data.csv
  • Dataset 2: Feed_the_Future_Cambodia_Interim_Survey_2017_Household_Members_Data.csv
  • Dataset 3: Feed_the_Future_Cambodia_Interim_Survey_2017_Mothers_Data.csv
  • Dataset 4: Feed_the_Future_Cambodia_Interim_Survey_2017_Children_Data.csv
  • Dataset 5: Feed_the_Future_Cambodia_Interim_Survey_2017_Womens_Empowerment_in_Agriculture_Index.csv
  • Dataset 6: Feed_the_Future_Cambodia_Interim_Survey_2017_Womens_Time Allocation.csv
  • Dataset 7: Feed_the_Future_Cambodia_Interim_Survey_2017_Womens_Empowerment_in_Agriculture_Index_Recode.csv
  • Codebook: Feed_the_Future_Cambodia_Interim_Survey_2017_Codebook.csv
  • Disclosure Plan: Feed_the_Future_Cambodia_Interim_Survey_2017_Disclosure_Analysis_Plan.docx
  • ReadMe: Feed_the_Future_Cambodia_Interim_Survey_2017_ReadMe.txt
  • Questionnaire: Feed_the_Future_Cambodia_Interim_Survey_2017_Questionnaire.pdf
  • Informed Consent: Feed_the_Future_Cambodia_Interim_Survey_2017_Informed_Consent.pdf
In this example, all components of the entry are identified by the USAID program (Feed the Future), the country (Cambodia), the nature of the data collection effort (interim survey), the year and, and a precise description of the component (Household_Data.csv, Codebook.xls, Children_Data.csv, etc.) Not all data resources stored in the DDL will be possible to name in this way, however, identifying each part of a data asset while linking them together by sharing parts of a “common” name is a best practice.
All file names should include an underscore (_) if a space is needed within the file name. Do not include any spaces in the filename of any document. Blank values in a data asset or dataset will disrupt the procedure by which the federal government harvests USAID’s metadata.
            Correct File Name: Feed_the_Future_Cambodia_Interim_Survey_2017
            Incorrect File Name: Feed the Future Cambodia Interim Survey 2017
Organizing Data for Submission
Consider ways to group datasets to form data assets before submitting data to the DDL. Be mindful of how a user unfamiliar with your data might best understand the structure and value of your data and construct your data assets accordingly. For example, consider a program starting in a number of countries where each country implements a baseline survey. One alternative is to group all of the baseline surveys into a single data asset. A second alternative, recognizing that a second and even a third survey will be done, is to create a data asset for each of the countries that will ultimately contain not only the baseline survey but also the follow-on surveys.
Removing Personally Identifiable Information (PII)
Datasets should be scrubbed of personally identifiable information (PII) about participants before submission to the DDL.
Click here for more information about direct and indirect identifiers in your data.
Direct identifiers are data that can be used to identify a person without additional information or with cross-linking through other information that is in the public domain. Examples of direct identifiers include names, social security numbers, and email addresses. If not properly handled, they may seriously compromise the privacy, security, and confidentiality of individuals whose records appear in the dataset. USAID requires the removal of direct identifiers before data are submitted to the DDL.
Direct identifiers include:
  • Name
  • Address
  • Birthdate/birthday
  • Village
Click here to find a list of the HIPAA Privacy Rule includes the following 18 specific data elements in its definition of direct identifiers.
For datasets on human subjects, all efforts to protect the privacy of those subjects should be thoroughly documented.  Name this document using the rule above, so it would become Feed_the_Future_Cambodia_Interim_Survey_2017_Deidentification.txt using the earlier example.
Often, the complete description of the de-identification process should not be made public for fear that the process could be reverse engineered. Where there is the possibility of reverse engineering, a shorter version of the description should be prepared for public consumption with both versions submitted to the DDL. Upload the document on the Data Detail tab as [dataset title] Deidentification_Public.txt as one of your Other Reference Materials when you submit your data.
Writing Dataset and Data Asset Descriptions
Before submitting to the DDL, descriptions of the data asset and datasets should be drafted recognizing that the group of descriptions must work together to fully describe the data resource. Datasets may not always appear together as a group on the platform catalog, therefore, the uploader might choose to keep the description of the data asset as part of the dataset description, adding one or two sentences to uniquely describe the dataset. The following are sample descriptions of a data asset and one of its datasets. In these examples, the first sentence of the data asset description is unique and the first and second sentences of the dataset description are unique, the rest of the description is identical between data asset and dataset.
  • Data Asset Description: This data asset contains the data from the interim population based survey carried out in the Zone of Influence (ZOI) of the Feed the Future program in Cambodia in 2015. The ZOI is the Pursat, Battambang, Kampong Thom, and Siem Reap Provinces. The sampling design called for a two-stage cluster sample. In the first stage, 84 villages were selected; in the second stage, households were selected within each sampled village. The sampling of villages was stratified by province, with the number of villages in each stratum proportional to the population in the stratum and with villages selected with probability proportional to size, based on the 2013 Commune Database. The data is split into survey modules. Modules A through C includes location information, informed consent, and the household roster. Module D includes household characteristics. Module E is the expenditures module broken up into 8 different parts. Modules F and G include the hunger scale data and WEIA index data. Data in modules H and I include mother and child dietary diversity.  
  • Description of the Household Dataset: This dataset captures data describing the members of each household from the first interim assessment of Feed the Future’s population-based indicators for the ZOI in Cambodia. It has 1019 rows and 295 columns. The ZOI is the Pursat, Battambang, Kampong Thom, and Siem Reap Provinces. The sampling design called for a two-stage cluster sample. In the first stage, 84 villages were selected; in the second stage, households were selected within each sampled village. The sampling of villages was stratified by province, with the number of villages in each stratum proportional to the population in the stratum and with villages selected with probability proportional to size, based on the 2013 Commune Database. The data is split into survey modules. Modules A through C includes location information, informed consent, and the household roster. Module D includes household characteristics. Module E is the expenditures module broken up into 8 different parts. Modules F and G include the hunger scale data and WEIA index data. Data in modules H and I include mother and child dietary diversity.
Documentation
Good documentation is essential to the long term usefulness of data. Including a codebook, informed consent, a ReadMe file, the survey questionnaire, and other relevant documentation with your data submission makes the context of the data clear to users. Thorough documentation of the context of data collection facilitates proper interpretation and use.
  • Informed Consent: Researchers, evaluators, public opinion pollsters and others are obliged to get informed consent from individuals who participate in their data collection efforts as informants. In order to implement its access plan for scientific research, USAID requires partners submitting data on human subjects to the DDL to submit the form used to obtain that informed consent to the DDL. 
  • ReadMe: For complex data assets with multiple datasets, a useful attachment is a “ReadMe” file with a roadmap to the various datasets and any information that would facilitate proper use of the data by others. Such a file could include more detailed descriptions of each dataset, a guide to linking the datasets, and instructions regarding characteristics of the data that should be considered during future analysis (such as the weighting of cases in a sample).
  • Questionnaire: Data created through a survey process should include the questionnaire used in data collection as an attachment. The questionnaire is often included as a pdf attachment to the dataset.
  • Other Attachments: For data assets and datasets generated during a research project or an evaluation, the partner generating the data writes a report or other intellectual work based on the data prior to submitting to the DDL. All such reports should be included in the submission, either as an upload to the DDL or as a link to the report archived in some other repository such as USAID’s Development Experience Clearinghouse (DEC).
  • Statistical Software Files: Although the format of datasets submitted to the DDL is required to be non-proprietary (csv file formats are preferred), partners submitting data assets to the DDL are also encouraged to submit versions of the data in formats produced by proprietary statistical software. This best practice reduces work by new users of the data and minimizes the likelihood of errors being introduced when they recreate the statistical version. Statistical package versions of data should be identical in content to the original non-proprietary data. Any efforts to de-identify data prior to submission to the DDL must be made to all formats of the data you submit.