gev_utils package

Submodules

gev_utils.DataImportUtils module

class gev_utils.DataImportUtils.DataMunger(data_dict)

Bases: object

Carries out data munging of self.data_dict. Drops unneeded data columns, removes header units, strips characters from numeric fields so that correct dtypes can be assigned, extracts station coordinates from station metadata, creates a datatime index and contains a method to return an extreme value dict.

Attributes:
self.data_dict (dict): A dictionary where keys are station

names and values are lists of station data and metadata.

self.station_coords_list (list): A list of latitude and

longitude extracted from station metadata.

assign_correct_dtypes()

Iterates through self.data_dict. Rows containing missing observations are dropped, and the following dtypes are assigned: Year (Int64), Month (Int64), Rainfall (float32).

Returns:

None

create_dt_index()

Iterates through self.data_dict, Year and Month columns are used to create a datetime intex which is assigned to a new column, ‘Date’ which is set to the index of the DataFrame. Year and Month columns are then dropped.

Returns:

None

drop_unneeded_cols(targ_col=['Rainfall'])

Drops all columns apart from ‘Year’, ‘Month’ and targ_column.

Kwargs:

targ_col (list): A list containing names (str) of columns to keep.

Returns:

Modified data_dict consisting of

columns in targ_col and ‘Year’ and ‘Month’ columns.

Return type:

self.data_dict (dict)

get_extreme_dict(data_dict)

Iterates through a data_dict and resamples by year, returning the maxima. It returns a dictionary containing the resampled DataFrames.

Parameters:
  • data_dict (dict) – A dictionary where the keys are station names and values are DataFrames. DataFrames must have a datetime index.

  • Returns

    maxima_dict (dict): A dictionary where the keys are

    station names and the values are DataFrames containing the annual maxima.

get_station_coords()

Iterates through self.data_dict, latitude and longitude is extracted from metadata and appended to self.station_coords_list.

Returns:

None

remove_units_header(targ_str='mm', targ_col='Rainfall')

Removes the first row if it has been used to specify units of measure.

Kwargs:

targ_str (str): The unit string to be removed.

Returns:

None

save_data(outputs_dir='outputs', **kwargs)

Iterates through self.data_dict and saves each DataFrame into ‘directory’ as a .csv named by the dictionary key (station name).

Parameters:

directory (str) – The name of the directory where .csvs should be saved.

Kwargs:

Keyword arguments to pass to pandas.DataFrame.to_csv().

strip_asterices()

Iterates through self.data_dict, DataFrames containing asterices are removed. Each DataFrame in self.data_dict is modified in place.

Returns:

None

tidy_data()

Carries out data munging by calling self.drop_unneeded_cols, self.remove_units_header, self.strip_asterices, self.assign_correct_dtypes, self.create_dt_index, self.get_station_coords.

Returns:

None

class gev_utils.DataImportUtils.GeoMunger(gdf_house_price, gdf_station_coords, aoi_outline=None)

Bases: object

Handles parsing, unifification and tidying of spatial data and supporting information (house price data and modelling results).

self.house_price

A pandas.DataFrame of average house prices.

Type:

pandas.DataFrame

self.station_coords

A pandas.DataFrame of station latitude and longitude.

Type:

pandas.DataFrame

self.metrics_df

A pandas.DataFrame of predictions, error bounds and supporting data.

Type:

pandas.DataFrame

self.unified_gdf

A geopandas.GeoDataFrame containing statistical data and geospatial data.

Type:

geopandas.GeoDataFrame

self.aoi_outline

A geopandas.GeoDataFrame of the outline of the area of interest (aoi).

Type:

geopandas.GeoDataFrame

add_metrics_to_stations()

Merges self.station_coords and self.metrics_df on station name with the result set to self.metrics_df.

Returns:

None

convert_crs(crs='EPSG:27700')

Converts the crs of self.house_price and self.station_coords to crs.

Kwargs:
crs (str): A valid crs string to be passed to

geopandas.GeoDataFrame.to_crs(), default EPSG:27700, (recommended for the UK).

Returns:

None

extract_station_metrics(gev_dict)

Extracts mean predicted return intensity and the 95% HDI from an instance of gev_dict and the corresponding return periods.

Parameters:

gev_dict (dict) – An instance of gev_dict generated by fitting extreme events to the generalized extreme value distribution using PyMC. NB gev_dict is generated by the accompaning Jupyter Notebook.

Returns:

A pandas DataFrame

containing only the information needed for carrying out analysis in the accompaning Jupyter noteobook.

Return type:

station_metrics (pd.DataFrame)

get_error_bounds(df, risk_var='Return intensity', targ_col='Risk-adjusted house price', hdi_cols=['Lower', 'Upper'])

This methods takes a df containing the 95% HDI and uses the ratio of the lower and upper bounds to calculate error for risk-adjusted house prices. It assigns the same ratio between mean rainfall prediction and the 95% HDI interval to the risk-adjusted house prices

Parameters:

df (pandas.DataFrame) – A pandas.DataFrame containing the 95% HDI and uses the ratio of the lower and upper bounds to calculate error for risk-adjusted house prices.

Kwargs:
risk_var (str): The name of the column containing the risk

variable in df.

targ_col (str): The name of the column continaing the

risk-adjusted values in df.

hdi_cols (list): A list of length 2 containing the names in

the HDI columns in df order must be lower bound followed by upper bound.

get_metrics_df(gev_dict)

Updates self.metric_df with mean prediction and lower and upper bound of the HDI (by default 95%) by extracting this information from gev_dict.

Parameters:

gev_dict (dict) – A dict containing results modelling results from the accompaning Jupyter Notebook.

Returns:

None

get_return_period_results(station_name, df, return_period_yrs=20)

This method takes a pandas DataFrame containing the predicted return period, intensity and confidence intervals and returns a single row DataFrame containing this metric in addtion to adding a column with the site name.

Parameters:
  • station_name (str) – The name of the met station.

  • df (pandas.DataFrame) – A pandas.DataFrame containing the results of the posterior preditive containing the return period, return intensity and HDI (95% is used in this example).

Kwargs:
return_period_yrs (int): The return period of interest,

default 20.

Returns:

A pandas.DataFrame containing the

return intensity, lower and upper intervals for return_period_yrs.

Return type:

df (pandas.DataFrame)

get_risk_adj_df(df=None, risk_var='Return intensity', adj_col='Average_Price', hdi_cols=['Lower', 'Upper'], risk_weight=0.5, risk_threshold=160)

Returns risk-adjusted house prices, lower and upper estimates. Lower and upper estimates are based on the ratio of the HDI lower/upper estimates for risk var. Risk weight determines the largest value of the risk factor, risk threshold set the level which must be exceeded for risk-adjustment to take place (in [mm] rainfall).

Kwargs:
df (pandas.DataFrame): A pandas.DataFrame containing the

risk varible predictions, HDI bounds and house prices.

risk_var (str): The name of the risk variable in df,

default value ‘Return intensity’.

adj_col (str): The name of the column containing values to

be risk-adjusted in df, default ‘Average_Price’.

risk_weight (float): The weighting that risk_var should

carry, larger values give a greater weighting. This value is passed to self.minmax_normalizer(). Default is 0.5.

risk_threshold (int): The value that should be exceeded for

risk-adjustment to occur, default 160 (mm rainfall).

Returns:

A pandas.DataFrame containing

risk-adjusted values and HDI bounds appendeded to df.

Return type:

df (pandas.DataFrame)

get_risk_adjusted_hp(df, adjustment_df, threshold=160)

This method gets risk-adjusted house prices for stations with rainfall above threshold. adjustment_df is a pandas.DataFrame of normalized rainfall.

Parameters:
  • df (pandas.DataFrame) – A pandas.DataFrame containing return intensity and house prices.

  • adjustment_df (pandas.DataFrame) – A pandas.DataFrame generated by self.minmax_normalizer.

Kwargs:
threshold (int): The value that ‘Return intensity’ should

exceed for risk-adjustment to take place, default value 160 (mm).

Returns:

A pandas.DataFrame which is df

with additional columns containing risk-adjustment in addition to a column contain the values used to peform risk-adjustment.

Return type:

df (pandas.DataFrame)

join_house_prices(cols_to_keep=None)

Joins self.house_price and self.station_coords using geopandas.sjoin_nearest() and sets the result to self.unified_gdf. The result is that met station will be assigned to their geographically nearest house price region. Also drops unneeded columns, a list of column names can be specified using cols_to_keep to change which columns are kept.

Kwargs:
cols_to_keep (list-like): A list of columns names to be kept

in the resulting GeoDataFrame.

Returns:

None

minmax_normalizer(df, risk_var='Return intensity', hdi_cols=['Lower', 'Upper'], upper_bound=0.5)

Min-Max normalizer which normalizes values to have a max value of ‘upper_bound’. Returns pandas DataFrame of the original values and the normalized values.

Parameters:

df (pandas.DataFrame) – A pandas.DataFrame containing the data to be normalized.

Kwargs:
risk_var (str): The name of the column to be normalized,

default ‘Return intensity’.

hdi_cols (list): A list of length 2, where each element is a

string of the name of the columns containing the lower and upper HDI, respectively. Default value is `[‘Lower’, ‘Upper’].

upper_bound (float): Specify the largest value, effectively

control the scale of the normalizer.

Returns:

A pandas.DataFrame

containing normalized values.

Return type:

adj_multiplier (pandas.DataFrame)

remove_nans()

Removes rows from self.unified_gdf which the column ‘Average_Price’ is NA and resets the index of self.unified_gdf. NB calling this method will change self.unified_gdf in place.

Returns:

None

tidy_data(gev_dict)

Performs data munging on a gev_dict instance.

Returns:

None

class gev_utils.DataImportUtils.WebScraper(base_url)

Bases: object

Scrapes met station data and parses data into a usable format.

self.base_url

The base url to be scraped

Type:

str

A dictionary where keys are station names and values are links to station data.

Type:

dict

self.station_dict

A dictionary where keys are station names and values are a list of data and metadata.

Type:

dict

self.nonstandard_list

A list of stations where the format of the metadata does not permit automatic parsing.

Type:

list

get_coordinates()

Attempts to extract coordinates from station metadata. If this fails, add the station name to self.nonstandard_list.

Returns:

None

get_data()

Scrapes met station data (measurements) and metadata (coordinates, whether a station was relocated) from self.base_url. Modifies self.station_dict values to contain a list of the station’s link, a pandas.DataFrame of station measurements and a list metadata.

Returns:

A dict where keys are station names and is a list of station data and metadata.

Return type:

self.station_dict (dict)

Scrapes base_url for met station names and met station links. Updates self.station_links and creates a dictionary where keys are met station names and values are met station links.

Returns:

None

gev_utils.PlottingUtils module

class gev_utils.PlottingUtils.Plotter(gb_outline_gdf, gev_dict, data_dict)

Bases: object

Creates visualizations. In general this class takes data and instances of matplotlib.figure.Figure and matplotlib.axes.Axes, it will create a visualization and return the modified fig and ax(es) instances for further fine tuning.

self.gb_outline_gdf

An outline of the island of Great Britain.

Type:

geopandas.GeoDataFrame

self.gev_dict

The results of statiscal modelling created by the accompanying Jupyter notebook.

Type:

dict

self.data_dict

The instance of data_dict created by DataImportUtils, containing data to be used in visualizations.

Type:

dict

get_difference(df, targ_col='Risk-adjusted house price', hdi_cols=['Lower_adj', 'Upper_adj'])

Helper method to get the absolute difference between values and hdi bounds as the absolute difference and not the values themselves are needed for matplotlib.pyplot.axes.errrorbar.

Parameters:

df (pandas.DataFrame) – A pandas.DataFrame of the risk-adjusted values generated by DataImportUtils.

Kwargs:
targ_col (str): A string of the name of the column

containing risk-adjusted values, default ‘Risk-adjusted house price’.

hdi_cols (list): A list of length 2 containing the name of

the columns with the risk-adjusted lower and upper bounds, respectively. Default [‘Lower_adj’, ‘Upper_adj’].

Returns:

A pandas.DataFrame

containing data used to add error bars to plots.

Return type:

error_bar_df (pandas.DataFrame)

plot_choropleth(geomunger, fig, axes)

Plots 3 choropleth maps in one figure, the first of Return intensity, the second is of the risk variable (average house price), and the third is of the risk-adjusted house price.

Parameters:
  • geomunger (class.GeoMunger) – An instance of GeoMunger.

  • fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.

  • axes (an array of matplotlib.axes.Axes) – The array of matplotlib.axes.Axes.

Returns:

The modified instance of

matplotlib.pyplot.Fig.

axes (matplotlib.axes.Axes or array of Axes): The modified

array of Axes of matplotlib.axes.Axes instances.

Return type:

fig (matplotlib.figure.Figure)

plot_posterior(idata, ax)

Plots posterior predictive check.

Parameters:
  • idata (arviz.InferenceData) – An instance of arviz.InferenceData.

  • ax (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.

Returns:

The modified instance of

matplotlib.axes.Axes.

Return type:

ax (matplotlib.axes.Axes)

plot_prior_check(idata, ax, xlim=[0, 180], ylim=[0, 0.4])

Plots samples from the priors

Parameters:
  • idata (arviz.InferenceData) – An instance of arviz.InferenceData.

  • ax (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.

Kwargs:
xlim (list): a list to be passed to matplotlib.pyplot.xlim,

default value [0, 180].

ylim (list): a list to be passed to matplotlib.pyplot.ylim,

default value [0, 0.4].

Returns:

The modified instance of

matplotlib.axes.Axes.

Return type:

ax (matplotlib.axes.Axes)

plot_prior_post_prediction(station_name, fig, axes)

Plots the prior predictive, the posterior predictive and predictions. Styles plots using climatex.mplstyle (hard-coded).

Parameters:
  • station_name (str) – A string of the station name key in gev_dict.

  • fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.

  • axes (an array of matplotlib.axes.Axes) – The array of matplotlib.axes.Axes.

Returns:

The modified instance of

matplotlib.pyplot.Fig.

axes (matplotlib.axes.Axes or array of Axes): The modified

array of Axes of matplotlib.axes.Axes instances.

Return type:

fig (matplotlib.figure.Figure)

plot_return_period(post_pred, return_periods, ax)

Plots a graph of the return intensity for each return period with 95% HDI.

Parameters:
  • post_pred (arviz.InferenceData.posterior) – An instance of arviz.InferenceData.posterior.

  • return_periods (numpy.array) – The numpy array used to generate predictions for the posterior.

  • ax (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.

Returns:

The modified instance of

matplotlib.axes.Axes.

Return type:

ax (matplotlib.axes.Axes)

plot_risk_adj_bar(df, fig, ax, index=None)

Plots risk-adjusted house prices and error bars, df is the result of GeoMunger.get_risk_adj_df().

Parameters:
  • df (pandas.DataFrame) – A pandas.DataFrame containing risk-adjusted house prices and statistical predictions.

  • fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.

  • axes (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.

Kwargs:
index (pandas.DataFrame.index): An instance of

pandas.DataFrame.index to be used for the y-axis (NB this is a horizontal bar plot, so these are station names).

Returns:

The modified instance of

matplotlib.pyplot.Fig.

axes (matplotlib.axes.Axes): The modified

matplotlib.axes.Axes instance.

df.index (pandas.DataFrame.index): The

pandas.DataFrame.index used a labels for the plot, these are station names.

Return type:

fig (matplotlib.figure.Figure)

plot_ts(station_name, fig, axes)

Plots the time series with annual maxima highlighted and the probability density function estimated using Gaussian kernals.

Parameters:
  • station_name (str) – A string of the met station name.

  • fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.

  • axes (matplotlib.axes.Axes or array of Axes) – An instance of matplotlib.axes.Axes.

Returns:

The modified instance of

matplotlib.pyplot.Fig.

axes (matplotlib.axes.Axes or array of Axes): The modified

array of Axes of matplotlib.axes.Axes instances.

Return type:

fig (matplotlib.figure.Figure)

Module contents