Friday, November 7, 2014

Data Normalization, Geocoding, and Error Assessment

- Introduction -

A table of sand mines in Western Wisconsin was provided by the Wisconsin DNR and each student was given the task of geocoding 15 to 20 mines. Each mine was geocoded by multiple students to showcase the user error involved in the geocoding process. The table proved by the WI DNR has not completely normalized and for some mines contained only Public Land Survey System (PLSS) addresses. This complicated the process of geocoding by introducing error. The process and errors associated with geocoding will be discussed in this post.

- Methods -

My specific mines were located in the table provided by the WI DNR and copied into a separate excel spreadsheet to eliminate any formatting that may cause errors later in ArcMap. The address field in the table was divided into its constitute parts, or normalized, so the address could be understood by the World Geocode Service address locator provided by ArcGIS Online. The normalized table was used to geocode the mine addresses in ArcMap, with the appropriate fields used as Address Input Fields. Once the address were geocoded, the match addresses and spatial locations were inspected. The PLSS addresses could not be automatically geocoded and manual placement was needed by using a PLSS feature class and satellite imagery base map. The 'Pick Address from Map' option was used for many of the address to locate the closest address to the mine entrance from a road.

Once the mines were geocoded and inspected, each student imported a shapefile of their geocoded mines into a common geodatabase. All the shapefiles were merged and each mine that corresponded to the mines I geocoded were queried out. The distance between my mines and the same mines geocoded by my classmates was calculated with the 'point distance' tool.

- Results -

Table 1: Sample of the normalization performed on the mine addresses.
The red color in the normalized table indicate PLSS address.

Table 2: The error in distance (Miles) between the same mines
geocoded by my classmates and I.

Map 1: A map showcasing the geocoding result.


- Discussion -

Of the 22 mines I was assigned, every one automatically geocoded based on the address given in the normalized table. However, only 10 were actually correct. PLSS addresses do not follow a format that the address locator could understand and were therefore matched based on town or county. These mines needed to be found manually by querying a PLSS polygon feature class of WI and then analyzing a satellite imagery base map for the closest address location. All address that were not PLSS contained the street address information in one field of the table. Some address locators can parse this information but it is always better to be thorough and normalize the address field into its constituent parts. Parts of an address that need separate fields include house number, direction prefix, street name, street type, sub-address, city, state, and zip (seen in Table 1). If this information is in the same field an address locator could become confused and not match or erroneously match the addresses. Some addresses were incomplete or spelled wrong, examples of attribute data input error.

When geocoding the mines, I either accepted the automatic match or picked an address from the map that corresponded best with the given information and interpretation of a satellite imagery base map. Some mines geocoded by my classmates were not given a specific address and the mine was located in the middle of a city or county (like the PLSS addresses). This happens with the address was not available or understandable by the address locator and the match was based on what the address locator could find (city, county, or state). One address was even located in Missouri. Each student was given the same table to normalize and geocode yet when comparing my mines to my classmates, only 8 out of 55 addresses were the same (Table 2, 0.01inch represents the same address). This demonstrates the amount of subjectivity involved in geocoding unstandardized data.

- Conclusion -

The process of geocoding can be easy or difficult depending on the quality of the data used. When data automation and compilation is unstandardized, differences and errors in attribute entry can arise. This introduces error into the geocoding process as well as a level of subjectivity on the analyst's part, potentially introducing more error. Normalization and thorough assessments of data entry quality is vital to geocode addresses, especially for data that is unstandardized.

- Sources -

Wisconsin Department of Natural Resources (DNR). Mine table.

ESRI geodatabase (2013), USA Census Data. Accessed through UWEC Department of Geography and Anthropology.


No comments:

Post a Comment