DOH Logo linking to the DOH Home Page

GIS

Blue Line Image
You are here: DOH Home » Division of Information Resource Management » GIS » Search | Employees
 Site Directory:    GIS

• GIS Home

• GIS Staff

• Interactive Maps

• Hardcopy Maps

• Data

• Projects

• GIS Examples

• Geocoding Services


Access Washington Logo linking to Access Washington Home Page

 

 

   

Guidelines for Address Matching and Geocoding

Purpose
Background
Address Standardization
Address Matching
Geocoding
The Importance of Geocoding
Geocoding Software
Street Centerline Data
Accuracy
Using Geocoded Data
Current geocoding process at DOH
Figure 1 Overview of Process
Benefits
Appendix A - Definition of Terms
Appendix B - Address Standardization
Appendix C - Local Government Data Used
Appendix D - Address Matching Accuracy
Appendix E - Return Code Samples
Appendix F - File Structures
Appendix G - Reference Data Samples
Revision History

Purpose

The Assessment Operation Group in the Washington State Department of Health coordinates the development of guidelines related to data development and use to promote good professional practice among staff involved in assessment activities within the Washington State Department of Health and in Local Health Jurisdictions in Washington. While this guideline is intended for an audience of differing levels of training related to data development and use, it assumes a basic knowledge of mapping concepts. It is not intended to recreate basic texts and other sources of information related to the topics covered, but rather focus on issues commonly encountered in public health practice and where applicable, to issues unique to Washington State.

Background

This guideline serves as a technical reference for geocoding street addresses, and provides background on the technologies used by the Washington State Department of Health (DOH). The DOH Division of Information Resource Management (DIRM) currently provides address standardization and geocoding services. These services are available to all divisions of DOH as well as other State agencies, Local Health Jurisdictions and other health related agencies. DIRM attempts to provide the highest quality and number of street level address matches. To this end, DOH has entered into data sharing agreements with many Washington State counties to share accurate street and parcel ownership data. Combining these data with commercially available data allows DIRM to maximize the number and quality of matches. Appendix A provides definitions of technical terms used in this guideline and acronyms not specified in the text.

Address Standardization

Address standardization takes a street address and ZIP code and attempts to correct misspellings and changes in ZIP codes. The address is parsed into standard pieces including the house number, street name, direction prefix, direction suffix, and the street type. Once these parts of the address are created the values are then standardized (e.g. AV becomes AVE, LP becomes Loop). Currently DIRM uses the Centrus software from Pitney Bowes Inc. http://www.centrus.com. The data used by Centrus is proprietary and comes from both the U.S. Postal Service (USPS) and Geographic Data Technologies (GDT). The data are updated quarterly and the standardized addresses are certified by the USPS for bulk mailing rates. The Centrus software compares addresses to a USPS national database. This step is critical for increasing geocoding match rates. (See Appendix B for examples.)

Address Matching

Address matching is the process of matching the street address and ZIP code in the original dataset to another address and ZIP code. Typically, the second address and ZIP code represent street centerlines or ownership parcels. The street centerlines can have address ranges and ZIP codes assigned to each side of the street. The ownership parcels have a single address and ZIP code assigned to a point.

Geocoding

There are three main types of geocoding functions. The first type assigns latitude and longitude to a street address that has been matched to a street centerline or ownership parcel. These addresses can then be displayed as points on a map, or aggregated to larger areas (e.g. city limits, wellhead protection areas, school districts). For example, this type of geocoding can be used to show points on a map for all the addresses in the Washington State Cancer Registry. CAUTION: In general, the latitude and longitude at which a health event occurred are confidential information. Just as publishing someone’s address is most often a violation of confidentiality, data users need to be sensitive to the scale at which they display dots representing health events on maps. Before disseminating such maps they need to be sure that this method of visualizing data does not violate confidentiality.

The second type of geocoding is used for data without a street address. If the data in the original dataset has a geographic reference (e.g. ZIP code, county, U.S. Census tract) it can be geocoded to those geographic features. The data can be displayed as counts in graduated colors on a map. For example, survey results that contain only ZIP codes can be shown on a map, by the number of results in each ZIP code.

The third type of geocoding is used for data without a street address or a specific geographic reference. This requires a common link between the data in a given data set and an existing geographic feature. For example, a data set that contains a hospital name and bed capacity can be shown at the hospital locations on a map. This is accomplished by linking the hospital name with previously geocoded hospitals that also contain the name.

The Importance of Geocoding

Estimates that over 90% of corporate America’s data has some sort of geographic reference, reinforces the need to spatially locate data. Geocoding allows DOH to display health-related information on maps, conduct geospatial analyses and to determine whether there are geographic patterns in rates of health-related events. For example, geocoding is needed to

  • conduct spatial analyses when investigating disease outbreaks and potential clusters;
  • assign health events recorded to ZIP code to the appropriate county;
  • aggregate address information to larger areas, such as ZIP codes, for displaying sensitive data while maintaining confidentiality.
A partial listing of geocoded data used at DOH includes; Births, Deaths, Cancer Registry, TB Cases, STD Cases, HIV Cases, WNV Tests, PHL Test Results, Diabetes Cases, Childhood Lead Exposures, Citizen Complaints, Fred Hutchinson CRC Data, Hospitals, Clinics, Pharmacies, Nursing Homes, Tobacco Retailers, WIC Clinics, WIC Clients, Hazardous Sites, Schools, Drug Labs, Radiation Equipment Licenses, Adult Family Homes, EMS Stations, Fire Stations, Ambulatory Surgery Centers, Kidney Dialysis Centers, Blood Centers, Boarding Homes, Daycares, Farmers Markets, Indian Health Services, Medical Testing Labs, Residential Treatment Centers, Farmworker Housing, Prisons and Mass Vaccination Sites.

Geocoding Software

In 2000, DIRM staff evaluated five software vendors for accuracy and overall match rates, ArcView, Centrus, MapMarker, GeoVista and Maptitude. Ultimately, the quality of the underlying data seemed to make the most difference in the match rate. However, Centrus provided address standardization that improved match rates, and DIRM decided to use the combination of Centrus and ArcView GIS.

Street Centerline Data

DIRM staff evaluated a variety of street centerline data, U.S. Census TIGER 2007-1992, Environmental Systems and Research Institute (ESRI) Streetmap, Geographic Data Technologies Dynamap 2000 and Navigation Technologies. These data sets were provided by the U.S. Census or purchased from commercial vendors. While no data set was complete, Navigation Technologies was the most accurate and complete for the entire state. Local level street data were also evaluated. The overall accuracy of street data obtained from counties and cities was higher than the other statewide data sets. Since no single data set is complete, multiple data sets are used. Appendix C shows the counties in Washington for which we have acquired local level digital data for streets or parcel centroids.

Accuracy

The process of address matching and geocoding involves many variables that affect the accuracy of the results. Below is a partial list of potential sources of inaccuracy.

  • The input address or ZIP code is incorrect.
  • The address standardization software incorrectly parses the address or ZIP code.
  • The street centerline attribute data may be incorrect for the address range, street name or ZIP code.
  • A street may be “flipped” so the address is placed on the wrong side or at the opposite end of the street. This can place a geocode in the adjacent U.S. Census tract or even county.
  • The various street and parcel data files do not exactly overlay with U.S. Census tracts. The boundaries of the tracts are based on TIGER streets. Latitude and longitude may be more positionally accurate than the TIGER data resulting in tract assignments that are incorrect.

Each successful geocode generates a match score (called “Av_score”) that reflects the accuracy of the match. Match scores range from 0 to 100. A score of 100 indicates that after the geocoding software parsed the address, a street or parcel was found where everything matched. A score of 0 indicates an approximate centroid match or an unmatched address. Appendix D contains some examples of address matches and the assigned scores.

CAUTION:  It is essential to assess the proportion of geocoded records and the accuracy of the matches when interpreting rates or other statistics based on geocoded data. (See Using Geocoded Data.)

Using Geocoded Data

In order to use geocoded data, especially at relatively small geographies such as the sub-county, the data analyst must be able to evaluate the accuracy of each geocode. At a minimum the “Av_score” field in the output file should always accompany the output data. For example, when there are no street centerline or parcel matches for the street addresses in the Washington State Cancer Registry, DIRM uses the 5-digit ZIP code or city name to assign addresses to the centroid of a ZIP code, city, or populated place. This process maximizes the number of records that can be assigned to a county and is useful for county level rates and reports. The user can use the “Av_score” to identify which records were geocoded using centroids and which were matched at the street level. These centroid geocodes may not be appropriate for small area analysis like cluster investigations or census tract level analysis.

Current Geocoding Process at DOH

DIRM uses the Centrus software to perform address standardization and ArcVIEW software to perform the geocoding and the assignment of spatial attributes. This process is automated using the Avenue scripting language inside ArcView. This allows the use of multiple street and parcel datasets. The accuracy and source of the geocodes are also tracked. See Figure 1 for an overview of this process.

    Address Standardization

  1. Address data are provided to DIRM in a digital format (i.e. Access, ASCII, dBase).
  2. The addresses are standardized using the Centrus software to fix misspellings, and ZIP code errors. Centrus also attempts to geocode the addresses, these are used as approximate matches (step 6) below.
  3. Address Matching

  4. Inside ArcView, the tolerances are set to accept only close matches (Av_score of 100).
  5. The standardized addresses are matched to street centerlines using the following data sets. Once a match is made the address is not used for the next data set.

  6. • TIGER 2007, U.S. Census Bureau
    • TIGER 2000, U.S. Census Bureau
    • Local Government streets or parcel databases.
    • TeleAtlas 2006
    • Streetmap 1000, Environmental Systems Research Institute (ESRI)
  7. Inside ArcVIEW, the matching tolerances are set to accept “approximate” matches only (Av_score of 75).
  8. If Centrus geocoded any addresses that ArcVIEW did not, they are included as approximate matches.
  9. Centroid Matching

  10. If requested, the unmatched addresses are matched to the following very approximate centroids. See Appendix E for examples.

  11. • NavTeq Street Centerlines
    • Centrus Zip Code matches with location codes of ZT* and ZB*
    • Zip Plus 4
    • 5 Digit Urban Zip Codes (smaller area than rural zip codes)
    • City Centers and Populated Places
    • 5 Digit Rural Zip Codes (removed ones crossing County lines)
    • Post Office Locations

    Geocoding

  12. Inside ArcView, the latitude and longitude are calculated for each matched address. This estimates the coordinates by averaging along a street segment and applying an offset of thirty feet from the centerline, or using the centroid’s latitude and longitude.
  13. Assigning Attributes

  14. Each matched address is assigned U.S. Census attributes and other geographic values. This is accomplished by comparing the latitude and longitude to other GIS spatial layers, using a point-in-polygon operation.
  15. Two output files in dBase format are created containing the matched addresses (with additional attributes) and the unmatched addresses. See Appendix F for the file structures.

Figure 1 Overview of Process

Process Overview

Benefits

  • Using an iterative approach on multiple data sets maximizes the number of matched records. (See Appendix G for examples.)
  • This approach provides the ability to customize the assignment of spatial attributes.
  • This approach uses existing GIS software maintained and supported by DOH.
  • The ESRI shapefile of points representing the matched addresses can be viewed with many GIS software packages.
  • The ArcVIEW portion of this process is automated using the Avenue scripting language.
  • This approach provides the ability to add additional street data sets as they become available.
  • The output file structure is provided in a standardized format and includes fields to identify the accuracy and source of the matches.

Appendix A - Definition of Terms

Approximate match is meant to represent acceptable address matches. This level of matching allows for slightly misspelled street names or missing street types or directional information.

Assign spatial attributes involves first geocoding an address then comparing its location to another GIS spatial layer. These layers most often contain polygon or area features (e.g. census block groups, city limits).

Attributes are information related to a map feature (e.g. census demographics pertaining to census tract).

Centroids are point locations representing areas, buildings or an approximate location.

Close match is intended to represent addresses that match a given street segment, using the street name, house number, and ZIP code information. The geocoding process automatically parses the input address and attempts some limited standardization before the matching is attempted. These matches are the most accurate possible.

Street segment is a portion of a street centerline in a linear GIS spatial layer. Streets are often divided up into these segments to incorporate changing address ranges, ZIP codes or other attribute changes.

TIGER (Topologically Integrated Geographic Encoding and Referencing system) refers to the system and data format the U.S. Census Bureau uses to display geography.

Appendix B - Address Standardization

These are some examples of the address standardizing Centrus provides.
Input Output Correction
131 Elm, 98501
131 Elm ST E, 98501 Adds street type and direction
200 Conger Ave, 98502 200 Conger ST NW 98502 Changes street type and adds a direction
601 Ryan RD, 98502 601 Ryan RD, 98512 Updates the ZIP code if necessary
400 Renton Ave NE, Renton, WA 98356 400 Renton Ave NE, New Castle, WA 98356 Updates the city name
333 Is Reel Rd SE, 98501 333 Israel Rd SE, 98501 Corrects the street spelling

Appendix C - Local Government Data Used

This map highlights Counties that DOH has acquired accurate GIS addressing data from. This data is in the form of street centerline files with address ranges, or parcel ownership points that contain the site address. The map also shows match rates by county for the 2008 Cancer Registry.

County Map

Appendix D - Address Matching Accuracy

Input Address Street Segment Attributes in ArcVIEW Av_Score
     
Typical Close Matches    
1490 LK DR. 1466-1577 Lake Dr 100
3706 Shoshone Dr 3700-3798 Shoshone Dr 100
1301 N Highlands Pkwy 1301-1399 N Highlands Pky 100
1301 Highlands Pkway 1301-1399 N Highlands Pky 100
3017 Lombard Ave Apt 809 3001-3099 Lombard Ave 100
     
Typical Approximate Matches    
9531 Forest Del Dr 9400-9600 Forest Dell Dr 90
1690 80th Street KP 1660-1700 80th K P St S 88
821 Port Susan Terrace Rd 801-849 Port Susan Ter Rd 83
12329 55th Pl W 12101-12399 5th Pl W 82
1521 Hwy 101 W Sp#29 1507-1531 USHY 101 80
5720 Blvd Ext Rd Se 5312-5898 Boulevard Rd Se 79
4450 Abelin Ct S #81 4400-4448 Abelia Ct S 78
124 Sussex St, 98589 98589Sussex 40

Appendix E - Return Code Samples

Match Type Accuracy Source Score Certainty Geolevel Quality N_lcode
               
Parcel Centroid Close, Approximate Local Parcels 75 - 100 1 CENSUSTRACT 02 AP*
Street Centerline Close, Approximate Centrus, TIGER 2000, GDT 1000, NAVTEQ, Local Roads 50 - 100 1 CENSUSTRACT 03 AS*
Street Centroid Very Approximate NavTeq Streets 40 3 ZIPCODE 05 ZT7F
Centrus Zip Code Centroid Very Approximate Centrus 0 4 ZIPCODE 09 ZB*, ZT*
Zip Plus 4 Centroid Very Approximate Zip4 Centroids 40 2 ZIPCODE 06 ZC9Z
5 Digit Zip Code Centroid Very Approximate 5 Digit Rural Zip Centroids
5 Digit Urban Zip Centroids
0 4 ZIPCODE 09 ZC5X
City/Place Centroid Very Approximate City or Place Centroids 0 5 CITY 11 ZC5X
Post Office Centroid Very Approximate Post Office Zip Centroids 0 5 ZIPCODE 10 ZC5Z
Unmatched     0 9   99  

Centrus field “N_lcode” value descriptions are available upon request.

Appendix F - File Structures

Field Name Type Width Decimals Example Description
           
Input          
Address Char 40   1060 S MAIN #47 Input street or mailing address (required)
Zip Char 5   99114 Input 5 digit ZIP code (required)
Zip4 Char 4   9656 Input ZIP Plus 4
City Char 20   COLVILLE Input city name
State Char 2   WA Input State
           
Output (Centrus standardization)          
N_address Char 40   1060 S MAIN ST TRLR 47 Standardized address
N_city Char 30   COLVILLE Standardized city name
N_zip Char 5   99114 Standardized ZIP code
N_zip4 Char 4   9656 Standardized ZIP Plus 4
N_housenum Char 6   1060 House number
N_street Char 30   MAIN Standardized street name
N_strsuf Char 6   ST Standardized street name suffix
N_predir Char 6   S Standardized street name prefix direction
N_postdir Char 6   E Standardized street name suffix direction
N_unit Char 6   47 Standardized unit number
N_State Char 2   WA Standardized State abbreviation
N_mcode Char 4   S80 Centrus code describing the standardization
N_lcode Char 4   ZC5X Centrus code describing the geocoding level
           
Output (Matched records only)          
*Accuracy Char 20   Close Type of match, “Close” or “Approximate”
*Source Char 20   TIGER 2000 Name of the reference data set used to geocode
Av_date Char 30   Tues Jan21 15:44:00 2003 Date geocoded
*Certainty Num 1 0 4 NAACCR Certainty Code
Geolevel Char 30   ZIPCODE CDC, Accuracy of Tract assignment
Quality Char 2   06 NAACCR Coordinate Quality Code
Av_score Num 3 0 100 Match score (100=”Close”, 40-99=”Approx.”)
Av_city Char 20   Collville City name if inside city limits.
Zcta Char 5   98502 Census 2000 ZIP Code Tabulation Area
Av_zip Char 5   99114 Geocoded ZIP code (may not match input ZIP)
X_coord Num 15 5 -117.9051 Longitude of address geocode
Y_coord Num 15 5 48.53513 Latitude of address geocode
Av_co Char 3   065 Census 1990 County FIPS Code (001-077)
Av_alpha Char 2   33 Alphabetical County ID (01-39)
Tract90 Char 6   950500 Census 1990 tract number
Tract90d Char 8   9505.003 Census 1990 Decimal format tract/block group
Bg90 Num 1 0 3 Census 1990 block group number
Block90 Char 4   13B Census 1990 block number
Tract00 Char 6   950500 Census 2000 tract number
Tract00d Char 8   9505.003 Census 2000 Decimal format tract/block group
Bg00 Num 1 0 3 Census 2000 block group number
Block00 Char 4   3745 Census 2000 block number
RUCA Num 3 1 3.6 Rural and Urban Commuting Code version 2
SDUNI Char 5   07145 School District Unique ID Code

* These three fields are mandatory for distributing geocoded data. This will allow users to evaluate the accuracy of the CENSUS attributes and Latitude/Longitude values. For example: A query for “Certainty = ‘1’ or Certainty = ‘3’” will select only street level geocodes and exclude inaccurate centroid geocodes.

Appendix G - Reference Data Samples

Street reference data with address ranges on each side, from TIGER 2006, TeleAtlas, GDT and ESRI.
Street Centerlines

Parcel centroids acquired from County Assessors, one street address for each point.
Parcel Centroids

Zip Plus 4 centroids located in the middle of the streets.
Zip Plus 4 Centroids

City Centers and Populated Places Centroids
City/Place Centroids

5 Digit Zip Code Centroids
5 Digit Zip Centroids

Post Office Locations
Post Office Centroids

Revision History

July 18, 2007
  1. Removed out of State records from processing if State field is provided as input.
  2. Removed the TIGER 1992 – 1998 street reference datasets.
  3. Removed the process step of geocoding to the original addresses first.
  4. Updated the zip code data to 2006.
  5. Updated the County parcel centroids.
  6. Added centroids for: Centrus 5 digit zip codes, Zip+4, urban zip codes, rural zip codes, city, places and post offices.
  7. Added fields to the output file structure for: standardized State, certainty, geolevel and quality.
August 7, 2007
  1. Added NavTeq street centroids (within each zipcode) as a matching option before centroids.
  2. Changed the ArcView Spelling and Sensitivity score for approximate matches to 75 from 80.
  3. Added a highways reference layer to get more State Route and US Highway geocodes.
January 23, 2008
  1. Modified the matching order to use TIGER datasets first. Any street or parcel match made without TIGER as a source contains potential errors in the CENSUS attribute assignments. For centroid level matches the errors should make the data unusable. The non-TIGER street or parcel centroid geocodes could have a 1% adjacent tract error and a 20% adjacent block error.
Known Issues
  1. In some cases PO BOX addresses are incorrectly matched to the Zip Plus 4 centroids.
  2. Zip codes that cross county boundaries have been removed from all the centroid data layers except Centrus.
  3. Any match with a score of less than 70 should be reviewed.
Planned Enhancements
  1. Modify the city and zip code centroids to identify those that are likely within a single CENSUS tract.
  2. Upgrade the batch process to ArcMAP 9.2.
  3. Build a query to remove Centrus centroid matches to 5 digit zip codes that cross counties.
  4. Evaluate other software for geocoding.

Contact: Craig Erickson
Craig.Erickson@doh.wa.gov
(360)236-4271


DOH Home | Access Washington | Privacy Notice | Disclaimer/Copyright Information


Washington State Department of Health
101 Israel Rd SE, P.O. Box 47904
Olympia, Washington, 98504-7904

Last Update : 12/19/2008 8:23 AM
Send inquires about DOH and its programs to the Health Consumer Assistance Office
Comments or questions regarding this web site? Send mail to the Subsite Developer.