Underestimation of N buildings / SQM in France

I believe that the number of buildings and the total floorspace area of buildings in France is underestimated in the French exposure files. This could result in a risk assessment with lower risks than are actually the case. If we look at the image, we can see that for example Belgium has a higher risk than France. The same is true for Germany:

The data

Occupancy (main usage)	ESRM20 N buildings	French Cadastre N Buildings	Factor	ESRM20 Footprint (sqm)	French cadastre footprint (sqm)	Factor
Res	14M	19.9M	1.42	1.5B	2.49B	1.66
Com	300K	1.4M	4.6	492M	567M	0.87
Ind	596K	477K	0.8	289M	513M	1.78
Total	15.3M	21.8M	1.42	2.28B	3.57B	1.57
Total with other types (unknown / agricultural / religious)	15.3	50.6M (22.7M are unknown)	3.3	2.28B	5.6B (1.6B is unknown)	2.46

There's also something strange with the number of buildings / sqm in the commercial and industrial taxonomies: while the number of commercial buildings is smaller in ESRM20, the SQM is bigger. In the industrial buildings the opposite is true: the number of buildings is smaller, but the area is larger. Could it be that the area per building in the commercial types is on average too big and for the industrial types too small?

Code used to get footprint size in France from ESRM20

For the footprint in ESRM20, I am using the total floorspace per building and divide it by the average height of the building.

Maybe good to notice: I think both files area_per_dwelling_France_RES and dwlngs_per_bldngs_France_RES have a typo: feature 9/10 are MCF/LWAL+CDM/H:1 and MCF/LWAL+CDM/H:2, but the CDM should be CDN. At least that's the case in the exposure files themselves.

The code needs taxonomy-lib to run. It's installed with pip install https://git.gfz-potsdam.de/globaldynamicexposure/libraries/taxonomy-lib/-/archive/main/taxonomy-lib-main.zip.

import pandas as pd
from taxonomylib import Taxonomy


def get_footprint(df):
    # Create a `height` attribute, that is returned from the `Taxonomy` class. The attribute can
    # look like: H:2, HBET:3-5, HBET:6-
    df["height"] = df.apply(lambda item: Taxonomy(item["TAXONOMY"]).get_section('height'),
                            axis=1)
    df["floors"] = None
    for idx, item in df.iterrows():
        # Split the string at the colon to get the key and the value
        k, v = item["height"].split(':')

        # If the key is `H`, the value is only one integer
        if k == "H":
            floors = int(v)

        # If the value ends with a `-`, the range is without a limit, therefore we take the
        # minimum amount of floors + 1
        elif v[-1] == "-":
            floors = int(v[:-1]) + 1

        # Else we take the highest value in the range. For example for HBET:3-5, the number of
        # floors will be 5.
        else:
            floors = int(v.split('-')[1])
        df["floors"][idx] = floors

    # The footprint is the total SQM, divided by the amount of floors and multiplied by the
    # amount of buildings.
    df["footprint"] = (df["AREA_PER_BUILDING_SQM"] / df["floors"]) * df["BUILDINGS"]
    df["total_area"] = df["AREA_PER_BUILDING_SQM"]* df["BUILDINGS"]
    return df


def get_residential_footprint(df_res, df_dwellings, df_area):
    # Merge df_dwellings and df_area based on TAXONOMY
    c_df = pd.merge(df_area, df_dwellings, on='TAXONOMY')

    # Set average footprint per taxonomy and make a dictionary
    sqm_dct = {}
    n_dwellings_dct = {}
    c_df["average_footprint"] = \
        (c_df["AREA_DWELLING_URBAN"] * c_df["DWELLINGS PER BUILDING"]) / c_df["FLOORS"]
    for idx, (taxonomy, footprint, n_dwellings) in c_df[["TAXONOMY", "average_footprint", "DWELLINGS PER BUILDING"]].iterrows():
        sqm_dct[taxonomy] = footprint
        n_dwellings_dct[taxonomy] = n_dwellings

    # Create a list of taxonomies of the residential area and match with the values in the
    # df_dwellings/df_area datasets.
    list_of_taxonomies = df_res["TAXONOMY"].unique()

    # The CSV files of dwelling area do not match with the taxonomies in the residential file.
    # There is the +LFC tag that is disregarded. Therefore, this one needs to be added to the
    # dictionary too. If the `LFC` tag is found, it is removed and matched with the taxonomies
    # that do exist in the CSV files of dwelling area.
    for t in list_of_taxonomies:
        lfc = t.find('+LFC:')
        if lfc != -1:
            lfc_slash = t[lfc:].find('/')
            _t = t[:lfc] + t[lfc + lfc_slash:]
            sqm_dct[t] = sqm_dct[_t]
            n_dwellings_dct[t] = n_dwellings_dct[_t]

    # The footprint is the footprint SQM according to the taxonomy dictionary multiplied by the
    # amount of buildings
    df_res["footprint"] = df_res.apply(
        lambda item: sqm_dct[item["TAXONOMY"]] * item["BUILDINGS"],
        axis=1)
    df_res["n_dwellings"] = df_res.apply(
        lambda item: n_dwellings_dct[item["TAXONOMY"]] * item["BUILDINGS"],
        axis=1)
    print(f"""
        ------------------------------------------
        Number of dwellings: {df_res["n_dwellings"].sum():.2f}
        ------------------------------------------
        """)
    return df_res


if __name__ == "__main__":
    # Open all files
    df_dwellings = pd.read_csv('dwlngs_per_bldngs_France_RES.csv', sep=',')
    df_area = pd.read_csv('area_per_dwelling_France_RES.csv', sep=',')
    df_res = pd.read_csv('Exposure_Model_France_Res.csv', sep=',')
    df_com = pd.read_csv('Exposure_Model_France_Ind.csv', sep=',')
    df_ind = pd.read_csv('Exposure_Model_France_Com.csv', sep=',')
    # print(df_res["OCCUPANTS_PER_ASSET_NIGHT"].sum() + df_com["OCCUPANTS_PER_ASSET_NIGHT"].sum() +
    #       df_ind["OCCUPANTS_PER_ASSET_NIGHT"].sum())

    # Get footprints for each exposure file
    df_com = get_footprint(df_com)
    df_ind = get_footprint(df_ind)
    df_res = get_residential_footprint(df_res, df_dwellings, df_area)

    # Sum all footprints

    sum_footprint_res = df_res["footprint"].sum()
    sum_footprint_com = df_com["footprint"].sum()
    sum_footprint_ind = df_ind["footprint"].sum()
    sum_footprint = sum_footprint_res + sum_footprint_com + sum_footprint_ind

    sum_buildings_res = df_res["BUILDINGS"].sum()
    sum_buildings_com = df_com["BUILDINGS"].sum()
    sum_buildings_ind = df_ind["BUILDINGS"].sum()
    sum_buildings = sum_buildings_res + sum_buildings_com + sum_buildings_ind

    print(f"""
        ------------------------------------------
        Footprint
        ------------------------------------------
        Residential: {sum_footprint_res:.2f}
        Commercial:  {sum_footprint_com:.2f}
        Industrial:  {sum_footprint_ind:.2f}
        Sum:         {sum_footprint:.2f}
        ------------------------------------------
        N Buildings
        ------------------------------------------
        Residential: {sum_buildings_ind:.2f}
        Commercial:  {sum_buildings_com:.2f}
        Industrial:  {sum_buildings_ind:.2f}
        Sum:         {sum_buildings:.2f}
        ------------------------------------------
        """)

Query of France Cadastre dataset

Source: data.gouv.fr batiments (Last updated: 24th of March 2023) / alternative dataset with all cadastre information

SQL Query for total number of buildings and footprint size:

SELECT usage_1, count(*), SUM(ST_Area(geometrie, True)) AS area
FROM batiment
GROUP BY usage_1
ORDER BY usage_1 DESC;

        usage_1         |  count   |        area        
------------------------+----------+--------------------
 Sportif                |    53062 |  42963005.55154355
 Résidentiel            | 19921308 | 2487924496.4517083
 Religieux              |    83440 | 23129203.045996834
 Industriel             |   477078 | 513365477.32243454
 Indifférencié          | 22707428 | 1619562691.4066281
 Commercial et services |  1399220 |  567139570.0830743
 Annexe                 |  4878844 | 271368712.85938895
 Agricole               |  1083990 |  577505751.2126371

The Indifférencié buildings are scattered around, they include some parts of train stations, but also large buildings that are definitely of res/com/ind type, that are failed to be defined as such. See for example here a map of a small part in Paris, with in red Résidentiel, in blue Commercial et services, in yellow Industriel and in black Indifférencié:

/cc @hcrowley