BugFlow Logo


a community resource for entomological collection digitization

BugFlow Main Page View on GitHub

Module 2B: Quality Control

Module Purpose:

This module offers possible data quality and data assurance step-wise strategies that can be implemented in various workflows. One needs an eclectic approach as some of these steps will likely apply to many digitization workflows and many organismal groups, while others will be irrelevant or highly specific to a given group. Examples of data that might be gathered here include errors found per some number of records checked, or outputs from an automated script that checks image quality, or perhaps a list of outliers that need verification (say a georeferencing script finds issues). The quality assurance path presented here is (mostly) linear, that is, presented as QA steps that would naturally follow each other in a digitization protocol.

Module Keywords:

quality assurance, quality control, QA / QC, script, data quality, error tracking, digitization rate, digitization cost

TaskID Task Name Explanations and Comments Resources
T1 Expert or manager reviews and approves keyed data. Since resources are seldom available for a manager to review all entered records, an informal or formal sampling scheme is usually required. Sampling schemes are often in the form of: initially review 100 percent of records by a new data capture technician, then reduce to review to 50%. If quality remains high, reduce sampling frequency to as low as about 10%. If problems are found for a given technician, increase sampling frequency as necessary. Consider how (what software, what system) will be used to review these data (e.g. export to spreadsheet or OpenRefine for review, or does the database software facilitate this task and include tracking metrics for the relevant DQ steps required / preferred).  
T2 Use automated scripts to check against authority files for existence of misspellings of taxon, geographical, or collector names, and for orthographic and data type inconsistencies.    
T3 Resolve problems identified in T2. This may involve merging duplicated collecting eventsor localities, correcting or deleting incorrect data, etc.  
T4 Confirm that records have been checked, when they were checked, and with what results.    
T5 Compare data with that of likely duplicates of the same occurrence or collecting event    
T6 Cluster data and look for outliers.    
T7 Compare data with related data. For example, compare taxon name and georeferencewith known or predicted geographic distribution or dates of collection.  
T8 Expose data to community for expert feedback   e.g., through Email, FilteredPush(http://wiki.filteredpush.org), institutional website, etc.

Essential Training:

Define metrics that can be measured to assess success of workflows using this module

Module Metrics, Costing, and Reporting:

Define metrics that can be measured to assess success of workflows using this module (reference specific TaskIDs).

Outreach Opportunities:

List outreach opportunities that arise in workflows using this module (reference specific TaskIDs).

Exemplar Workflows:

List of examplar workflows organized by database type.


No discussion yet. Open an issue and reference this module to start discussion.

Module List

See main page here