Recall that given the sparsity of the semantic information encoded in OSM building data, our objective is to predict the type of buildings based on data available in OSM. Below we describe (1) ground truth and OSM data acquisition for each of the study regions, (2) data processing, (3) feature extraction from OSM building footprints, and (4) building classification. For reproducibility, all our code is available at https://github.com/heykuldip/osm_buildings_classification and a repository of the data used is at https://osf.io/3j46v/.
We selected three study areas for which we were able to obtain ground truth data including Fairfax County, Mecklenburg County, and the City of Boulder. We included the City of Boulder to examine model differences in a city versus suburban setting (see Table 2). While extremely useful, we note that official data mapping or defining building types is not publicly available for the vast majority of the counties in the US or other regions elsewhere.
We used PyOsmium13 to extract building polygons from ways and relations elements of OSM. We extracted the building footprints based on whether the ‘building’ tag of OSM polygons have any values. This step is necessary, as OSM include many spatial objects that are not buildings, such as bodies of waters, trees, roads, and intersections.
We downloaded the official building footprint data with associated building types from each administrative unit’s spatial data portal. Ground truth data for Fairfax County was obtained from14; ground truth data for Mecklenburg County was obtained from15; and ground truth data for City of Boulder was obtained from16.
Since our goal is to predict residential and non-residential building types, we first map a large number of heterogeneous building types in both OSM and the ground truth data (e.g. apartments, church, office) to these two classes. For OSM data, we aggregate building types based on the building tag values to create three meta-categories—residential, non-residential, and unknown (Table 3). We note that in OSM the unknown category is by far the most common, composed mostly of buildings with the tag value ‘yes’. This categorization was used to compare the OSM raw data to the ground truth data to produce Fig. 1.
For the ground truth datasets for Fairfax County, Mecklenburg County, and City of Boulder, which we use to train and validate our models, we aggregate building types based on building tags to create two meta-categories—residential and non-residential (Table 4). We exclude buildings for which no clear building type is provided or which are not clearly buildings, so as to not compromise our ground truth data (i.e. buildings labeled as building types ‘Mobile Home’, ‘Agricultural’, ‘Foundation/Ruin’, and ‘Misc’). This way, we excluded 2.24%, 0.35%, and 32.33% of total buildings in the Fairfax, Mecklenburg, and Boulder official datasets, respectively.
To find the corresponding buildings in OSM and in the ground truth datasets, we perform a spatial join on the building polygons across the two datasets. Therefore, every building in OSM is mapped to the building in the ground truth data having the largest spatial intersection. Buildings in OSM that do not intersect any building in the ground truth data are removed from our study. For example, Fairfax County has 269,366 official buildings. A join between the official data and the OSM building footprint data results in 197,215 official buildings and 204,672 OSM buildings. The difference can be explained whereby in some cases, many smaller buildings in OSM are contained by one official building. For each of these buildings, we now have both a rich source of data from OSM as well as the ground truth building type obtained from the official sources. In the data pre-processing step, we used the Geopandas17 library for geospatial operations on our input data.
Deriving features for classification
Geometric properties of building footprints and their spatial relationship to other features can be used to predict building type18,19,20. Therefore, we enhance the sparse building attributes found in OSM data by deriving several new geometric attributes based on the shape and location of the building footprints. Below we describe the features, including proximity to roads, proximity to parking lots, building footprint area, intersection with land use, and existing tags, and how these features are obtained from OSM.
Proximity to roads
The road network is one of the most exhaustive features in OSM that has been used as an effective method for identifying residential buildings21. We use a similar technique and extend it to predict both the residential and non-residential class. While many buildings in OSM do not have an explicit building type tag, all road segments in OSM have tags (stored in the ‘highway’ tag of a road segment) indicating the specific road class (e.g., ‘residential’, ‘motorway’, or ‘service’). We hypothesise that this information is a useful predictor to classify the type of nearby buildings.
For this purpose, we enrich each building in OSM with multiple dichotomous indicator variables that discriminate whether or not each building falls in range of four road meta-categories: (1) residential roads, (2) highways, (3) motorways, and (4) service roads. The OSM ‘highway’ tag defines the road types according to their types and capacities, varying from pathways to expressways. The road type tags in OSM map to our meta-categories as follows: (1) Residential Roads: Using tag values ‘residential’ and ‘living_street’; (2) Highways: Using tag values ‘primary’, ‘secondary’ and ‘tertiary’; (3) Motorways: Using tag values ‘motorway’ and ‘trunk’; (4) Service roads: Using tag value ‘service’.
For each meta-category of roads, we add three indicator attributes to each building, where a value of 1 indicates that the building is located in a 0–30 m, 0–60 m, and 0–90 m range of the road network and a value of 0 indicates that it is not. This yields a total of twelve indicator attributes for each building where indicators 0 to 3 correspond to residential roads, 4 to 6 to highways, 7 to 9 to motorways, and 10 to 12 to service roads. For example, the indicator values [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1] indicate that a building falls within a 0–30 m radius of a residential road (indicated by the first indicator variable) and thus, also in a 0–60 m and 0–90 m radius (indicated by the second and third indicator variable). The building is not in range of any highways or motorways. However, the building falls within 0–90 m distance of a service road (indicated by variables twelve), but not within 0–30 m or 0–60 m.
To efficiently compute the indicator variables for each building, we create corresponding buffers (of 0–30 m, 0–60 m, and 0–90 m) around each road in OSM. Then, we perform a spatial join between these buffers and the polygons of the buildings in OSM. For each intersection, depending on the ‘highway’ tag of the road, the corresponding indicator variable (using the mapping above) is set to 1.
Proximity to parking lots
We hypothesize that distance from parking lots of various sizes can be used to predict building type. For example, we would expect that as parking lot size increases, the likelihood that the building is a non-residential building would also increase. We extract the parking lot geometries from the OSM data using the ‘amenity’ tag having a value of either ‘parking’ or ‘parking_space’. We first examine the distribution of parking lot size across the study region and create three classes of parking lots based on the natural breaks of the parking lot size distribution using the Fisher-Jenks algorithm22.
Next, we enrich each building in OSM with parking lot indicator variables that indicate whether or not each building falls in a 30 m, 60 m, or 90 m range of three parking lot categories: (1) small, (2) medium, and (3) large, yielding a total of nine additional indicator variables. To compute the parking lot indicator variables for each building, we create corresponding buffers around the parking lots. We then perform a spatial join between these buffers and the polygons of the buildings in OSM.
Building footprint size
The size of a building footprint can a key predictor of building type23. Therefore, in addition to the road network and parking lot buffers, we compute the area based on the building footprint geometry and use the area as another (ratio-scaled) feature for our decision tree model.
Intersection with land use
OSM data includes the geometries and descriptive attributes for different underlying land use upon which the buildings are located. This data may explicitly contain information on the use of the land that the buildings are built on, thus providing insight into the use of the building itself24. Therefore, we extracted polygons having the ‘landuse’ tag in the OSM data and spatially joined them with the building footprints, resulting in another feature for our machine learning model.
OSM building tags
In addition to geometry, each building has a set of associated tags, which describe features using pairs of unique keys and corresponding values. Besides the above derived features, we utilized the tags from the OSM data that we deemed relevant for accurately categorizing the buildings. The tags are: ‘building’, ‘name’, ‘source’, ‘addr:street’, ‘building:levels’, ‘shop’, ‘website’, ‘brand’, and ‘amenity’. With the exception of the ‘building’ tag, each of the tags themselves are treated as a binary indicator variable where buildings have a value of 0 if they do not have a tag and 1 if they do. For the ‘building’ tag, we utilize the tag value rather than the presence or absence of the tag itself and encode each of the values as a nominal indicator variable. Since there are theoretically an infinite number of building tag values, we select the most common values, namely the values ‘apartments’, ‘church’, ‘civic’, ‘commercial’, ‘construction’, ‘detached’, ‘dormitory’, ‘garage’, ‘garages’, ‘greenhouse’, ‘hospital’, ‘hotel’, ‘house’, ‘industrial’, ‘kindergarten’, ‘office’, ‘parking’, ‘public’, ‘residential’, ‘retail’, ‘roof’, ‘school’, ‘semidetached_house’, ‘service’, ‘shed’, ‘static_caravan’, ‘terrace’, ‘warehouse’, and ‘yes’. We create a separate nominal variable called ‘miscellaneous’ that includes all the remaining unique building values across the three study areas.
In general, we manually selected these tags based on their relevance to distinguish building types while making sure that the model is capable of transfer learning independently of any geographic area. For example, if a building contains a website address, it seems more likely to be classified as ‘non-residential’. It is worth noting, however, that our model is flexible to handle any tags available in the OSM raw data, the hand-picked tags are a proof-of-the-concept of our proposal.
Decision tree classification
Using the features described in the previous sections, we use a classic C4.5 binary decision tree classifier25 to recursively find the attributes that yield the highest information gain to construct the decision tree. To train the decision tree, we use the authoritative ground truth building type obtained from the respective counties and city. Our choice of using a decision tree for classification was made due to it’s interpretability, allowing us to understand where and why classification errors are made to guide our search for discriminatory features to separate the residential and non-residential classes. To parameterize our decision tree, we use Gini-index26 which is commonly used as a measure of impurity between classes. We prune the decision tree when no additional decision criterion increases the impurity of a node by no more than 0.01%.