There were a number of steps in the classification of ESB Network (ESBN) connections data.
The CSO engaged in a number of meetings and workshops with ESBN to fully understand the data available and the limitations of this data in estimating housing activity. Following this engagement ESBN provided the CSO with a data extract of relevant connection information since 2011.
The next step was to add an Eircode to the connections data. This was completed by an administrative data linking exercise where a combination of unique reference numbers, address strings and location information were used to link the data to other data sources, including BER, eStamping and Geodirectory.
Adding the Eircode then allowed us to link a sample of the connections dataset to the Census of Population datasets for 2011 and 2016. By confirming the status of dwellings in this sample, the Census information allowed us to investigate the characteristics of new and other dwellings in the connections dataset between the two Census waves.
Recursive partitioning, a statistical method for multivariate analysis, was then used to create baseline rules for classifying the data not in the Census sample. Further linking with BER, eStamping and Geodirectory was then used to validate and refine these rules.
The principal data source for the New Dwellings Completions (NDC) is connections data provided to the CSO by ESBN. This information is collected on the ESBN form NC2 Single Domestic Dwelling or Farm Dwelling application for single dwellings and form NC1 Multi-Unit Development for scheme dwellings and apartments.
To produce detailed analysis of the existing connections series and to allow for a new series to be developed, ESBN provided the CSO with an extended historical data series. This data series included additional variables not previously made available to the DHPLG to produce their ESB connections time series.
The ESBN data indicates that the dwelling is an apartment, single house or scheme dwelling and also indicates whether it is urban or rural.
Important dates in the connection process include:
The file also contains comments fields and other indicators relating to non-dwelling connections or reconnections after a period of more than two years. These are either from the original NC2 form or are entered by the ESBN engineers during the processing of the application.
Detailed address information is also provided where available. This includes Irish Grid coordinates for the geographical location of the site from an Ordnance Survey mapping perspective and latitude/longitude coordinates of the electricity substation.
Finally, the Meter Point Reference Number (MPRN) is also provided which is essential for linking the data to other sources such as the BER dataset.
This information is compiled by ESBN and made available to the CSO on a quarterly basis.
A second key data source for the NDC is the Building Energy Rating (BER) dataset. Under Statutory Instruments (S.I.) No. 666 of 2006 and No. 243 of 2012, a BER certificate must be secured, by the person who commissions the construction of the new dwelling, before the dwelling is occupied for the first time (with some very minor exceptions). As part of the BER assessment process, detailed information on the physical characteristics of the dwellings are collected, including the type of dwelling (detached house, semi-detached house, etc.). The address of the dwelling is also captured. These data are compiled by the Sustainable Energy Authority of Ireland (SEAI) and made available to the CSO on a quarterly basis. The BER dataset also includes the date of construction of a dwelling and whether the BER certificate is for a new house (Final or Provisional) or for an existing dwelling (Existing).
A third data source used in the compilation of the NDC is the Geodirectory. This is a dataset of all buildings in the State, created and maintained by the Geodirectory using data from An Post and Ordnance Survey Ireland (OSi). The Geodirectory contains the postal address, including Eircode, of every building in the State, along with other geographical information such as X/Y coordinates. The Geodirectory also has dates of creation and information on use or status for all buildings.
These data are compiled by Revenue and made available to the CSO on a monthly basis. They contain information on address as well as details on whether a dwelling is being sold as new or second hand.
The files from Census 2011 and 2016 were used to clarify the status of a dwelling on Census night and this information was then used in the historical analysis of the ESB connections data. For example, a dwelling could change status from vacant to occupied between 2011 and 2016 or a dwelling could exist in 2016 but not in 2011.
The ESB connections dataset covers new dwellings but also includes non-dwellings, connections after more than two years of disconnection and dwellings that are part of unfinished developments (so-called “ghost” estates).
A connections dataset from ESBN for the period 2011 to 2016 was matched with data from Census 2011 to determine adjustments required to remove those connections which do not refer to new dwellings so that a series on New Dwelling Completions could be produced.
The reference year of 2011 was chosen for this matching exercise as this allowed the CSO to use Census 2011 and 2016 data to determine the level of new dwellings completed in the intercensal period.
The instructions to Census enumerators are that a dwelling is included in the Census enumeration if it is habitable, which is defined as a dwelling which has a roof, walls, hall door and windows installed.
A dwelling is defined as “under construction” on Census night if any of the roof, walls, hall door or windows are not installed. If a dwelling is included in the Census enumeration and is not classified as “under construction” then the dwelling is assumed to be complete on Census night.
Census enumerators in 2011 used the Geodirectory as of Q4 2010 for the list of properties to be enumerated. Any unlisted habitable dwellings are added to the Census, however buildings under construction and not in the Geodirectory list were not enumerated.
The linking of the ESB connection of a dwelling to its Census record in 2011 or 2016 can show the status of a dwelling on Census night and clarify the reason for an ESB connection. For example, if a dwelling exists in 2016 but not in 2011 then this can be used to determine if the dwelling was completed over the period covered by the connections dataset.
A sample of connections energised between the two Census waves is then used to estimate the level of new dwellings over this period.
It was not possible to use the Eircode as the unique identifier to directly match the ESB connections data to the Census data, as currently the connections dataset does not have an Eircode associated with each connection.
Therefore, after cleaning the address data, a number of alternative matching methods were used to allocate an Eircode to the connections data and allow linking from ESB to Census. The following steps were taken to assign an Eircode to an ESB connection:
• ESB data can be linked to the Building Energy Rating (BER) dataset using the unique identifier of the MPRN. This matches 9.1 % of the entire ESB connections dataset (2011 - Q1 2018).
• The CSO allocated Eircodes to eSamping data as part of the Residential Property Price Index (RPPI) project. Thus the eStamping data can be linked to the connections dataset using the MPRN and the BER reference number. 27% of all Eircodes were matched using either BER or eStamping.
• A further matching process was implemented by matching an address string that lacks an Eircode to all the address strings in a master dataset. The master dataset typically contains all the addresses in the country and include geo IDs such as Eircode (where available). For this project the master dataset used was the An Post Geodirectory. The fuzzy matching process used is based on string edit distances which is a measure of the difference between any two address strings. In order for this to work accurately, all addresses, including those in the master, must be standardised so that differences caused by things like case, abbreviations, punctuation, non-standard spelling etc. do not impact on the match score. Various other operations are completed on the address string to negate the variability that is possible when people write down their addresses. Once the preparation stage is complete then the matching process starts and in this project the Levenshtein distance was used. Address matches with a sufficiently high score are chosen and then Eircode is assigned from the master dataset. 13% of Eircodes were matched using this method.
• A number of ESB connections have an Irish Grid reference supplied. This can be used to match connections to the Geodirectory and an Eircode added. This is used primarily to match single rural dwellings and as the reference is based on an OSi map it is only used when there is one dwelling within 25m of the reference provided.
• Finally, fuzzy matching was used to match ESB connections to the Geodirectory, using address fields and the coordinates of the ESB substations. The remaining 60% of all Eircodes were matched to the Geodirectory using the Irish Grid reference or substation coordinates.
The matching process assigned Eircodes to 43% of ESB connections, as ESB connections in 2017 and 2018 would be less likely to have had an Eircode allocated. This rate increases to 61% for connections energised between Census 2011 and 2016. This sample can be be used to match with Census of Population data.
The standard success rate when allocating Eircodes to an existing dataset is 60% because of various issues with Irish postal addresses, including the proliferation of non-unique addresses and thus the match rate of 61% with the Census files is in line with industry standards.
The matching of ESB connections with the Census data was made more difficult as the ESB connections are authorised early on in the construction process and the address used can be different to the final “Postal Address” used for matching with the Geodirectory.
The ESB connections data with a matched Eircode can be broken down as:
ESB connections with an Eircode that were energised between the two Census waves were then matched to Census 2011 and 2016.
As we do not expect any of the matching methods used to be 100% accurate, a final sense check on the quality of the match was then undertaken and uncertain matches removed as they could have an impact on the classification rules.
These included cases where
• Records which were not matched to Census 2011 but had a Geodirectory create date earlier than 2010 and so would have been in the dwelling list of the Census enumerators for Census 2011.
• Records matched to Census 2011 which have a Geodirectory create date greater than 2012 and so would not have existed to have been enumerated in Census 2011.
Finally all connections energised in 2011 and in 2016 were removed from the sample dataset to ensure that the classification rules are not influenced by the timing of connection of buildings under construction or completed around the time of the 2011 or 2016 Census.
This left a final sample of 22,800 connections energised between the two Census waves with a valid Eircode (47%) for use in deriving rules for classification.
The Census historical analysis enabled us to check the quality of the ESB flags for reconnections and non-dwellings. We would expect the majority of the records flagged as reconnections to be for older dwellings and over 82% reconnections that were matched to Census 2016 had a year built recorded in the Census before 2011. We would also not expect to be able to find the records flagged as non-dwellings in the Census data and 95% of non-dwellings could not be matched to either Census file.
The understanding of the ESB connections database gained from the historical analysis also allows the derivation of methods to classify connections as New or UFHD dwellings where there is no Census match.
From the data matching above there are 22,820 records in the sample to be used to derive classification rules. This sample represents 40,104 ESB connections over the period 2012 to 2015.
Based on the historical analysis Census matching these records are classified as:
• 13,994 records which were not found in the Census 2011 or were found in Census 2011 with a status "Under Construction" and are classified as New Dwellings
• 5,492 records which were found in Census 2011 but were not flagged as reconnections were classified as UFHD.
• 3,334 records flagged as reconnections by ESBN were classified as such.
Recursive partitioning are nonparametric statistical techniques for assessment of importance of variables, prediction and classification. They provide decision trees which display the succession of rules that need to be followed to derive a predicted value or class. Two recursive partitioning methods, CART Modelling via rpart in R and Random Forest were used to derive basic classification rules to categorise dwellings in the ESB connections dataset.
Both these methods indicated that the days between the authorised and energised dates in the ESB connections dataset was the most significant factor in determining the status of the connection, followed by type of dwelling.
A training and test process was used to generate and validate the results. After removing the reconnections three-quarters of the sample (14,614 records) was used as a training dataset to generate the following rules:
• If the number of days between the authorised and energised dates are < 1,320, then the connection is classified as a new dwelling
• If the number of days between the authorised and energised dates are >= 1,320 then
ο Where the dwelling type is a scheme or apartment, the connection is classified as a UFHD
ο Where the dwelling type is a single house, the connection is classified as a new dwelling
The remaining 25% of the sample (4,872 records) was then used as a test dataset to validate the accuracy of these rules. Table 7.1 below shows that the rules above were correct in classifying new and unfinished dwellings to an 83% accuracy level for the test dataset.
The results of both the CART and Random Forest recursive partitioning methods agreed in over 99% of records when compared using the test sample.
From Table 7.1 it can be seen that for new dwellings there were 519 false positive and 332 false negatives in the predicted classification of New Dwelling Completions.
|Table 7.1: Confusion matrix|
A dwelling can be a false positive and incorrectly classified as “new” if it is part of an unfinished housing development but was not authorised at the time it was complete and thus the number of days between the authorised and energised dates is less than 1,320 days
A dwelling can be a false negative and incorrectly classified as not being “new” because the number of days between the authorised and energised dates is more than 1,320. However, these dwellings were not actually completed close to the period they were authorised. Often these cases are the result of a phase of an unfinished development being authorised but not started or completed to a significant level before the development stalled and then restarted in a much later period.
Taking the False Postives and False Negatives into account there is likely to be a slight bias towards overcounting new dwellings in these rules, however these errors can be corrected by linking to other data sources such as BER and Geodirectory where additional information on the construction period is available.
Records are classified as reconnections and non-dwellings if flagged as such by ESBN.
New and UFHDs dwellings are classified based on the rules generated by the recursive partitioning described above. The CSO then adjusts for false positives and false negatives and other classification errors.
• Classified as new but dwelling is a UFHD. These are corrected by validating with the date of construction from the BER system when it is a “final” or “provisional” BER certificate.
• Classified as UFHD but dwelling is new. These are corrected by validating with the date of construction from the BER system when it is a “final” or “provisional” BER certificate.
Further adjustments are made using BER data including cases where a dwelling is a reconnection but is not flagged as such by ESBN. These are corrected by validating with the date of construction from the BER system when it is an “Existing” BER certificate. Similarily, connections that are classified as a non-dwelling but have a domestic BER certificate are reclassified as new.
If no BER entry exists then false negatives and false positives cannot be corrected unless an Eircode can be allocated to the ESB connection, in which case the creation date in the Geodirectory can also be used to validate and correct. As Eircode usage increases in the different stages of the housing lifecycle, there will be increased accuracy in the classification of ESB connections.
The New Dwellings Completions data series will be revised over time as more information on the dwellings in the ESB connections dataset becomes available.