gev_utils package
Submodules
gev_utils.DataImportUtils module
- class gev_utils.DataImportUtils.DataMunger(data_dict)
Bases:
objectCarries out data munging of self.data_dict. Drops unneeded data columns, removes header units, strips characters from numeric fields so that correct dtypes can be assigned, extracts station coordinates from station metadata, creates a datatime index and contains a method to return an extreme value dict.
- Attributes:
- self.data_dict (dict): A dictionary where keys are station
names and values are lists of station data and metadata.
- self.station_coords_list (list): A list of latitude and
longitude extracted from station metadata.
- assign_correct_dtypes()
Iterates through self.data_dict. Rows containing missing observations are dropped, and the following dtypes are assigned: Year (Int64), Month (Int64), Rainfall (float32).
- Returns:
None
- create_dt_index()
Iterates through self.data_dict, Year and Month columns are used to create a datetime intex which is assigned to a new column, ‘Date’ which is set to the index of the DataFrame. Year and Month columns are then dropped.
- Returns:
None
- drop_unneeded_cols(targ_col=['Rainfall'])
Drops all columns apart from ‘Year’, ‘Month’ and targ_column.
- Kwargs:
targ_col (list): A list containing names (str) of columns to keep.
- Returns:
- Modified data_dict consisting of
columns in targ_col and ‘Year’ and ‘Month’ columns.
- Return type:
self.data_dict (dict)
- get_extreme_dict(data_dict)
Iterates through a data_dict and resamples by year, returning the maxima. It returns a dictionary containing the resampled DataFrames.
- Parameters:
data_dict (dict) – A dictionary where the keys are station names and values are DataFrames. DataFrames must have a datetime index.
Returns –
- maxima_dict (dict): A dictionary where the keys are
station names and the values are DataFrames containing the annual maxima.
- get_station_coords()
Iterates through self.data_dict, latitude and longitude is extracted from metadata and appended to self.station_coords_list.
- Returns:
None
- remove_units_header(targ_str='mm', targ_col='Rainfall')
Removes the first row if it has been used to specify units of measure.
- Kwargs:
targ_str (str): The unit string to be removed.
- Returns:
None
- save_data(outputs_dir='outputs', **kwargs)
Iterates through self.data_dict and saves each DataFrame into ‘directory’ as a .csv named by the dictionary key (station name).
- Parameters:
directory (str) – The name of the directory where .csvs should be saved.
- Kwargs:
Keyword arguments to pass to pandas.DataFrame.to_csv().
- strip_asterices()
Iterates through self.data_dict, DataFrames containing asterices are removed. Each DataFrame in self.data_dict is modified in place.
- Returns:
None
- tidy_data()
Carries out data munging by calling self.drop_unneeded_cols, self.remove_units_header, self.strip_asterices, self.assign_correct_dtypes, self.create_dt_index, self.get_station_coords.
- Returns:
None
- class gev_utils.DataImportUtils.GeoMunger(gdf_house_price, gdf_station_coords, aoi_outline=None)
Bases:
objectHandles parsing, unifification and tidying of spatial data and supporting information (house price data and modelling results).
- self.house_price
A pandas.DataFrame of average house prices.
- Type:
pandas.DataFrame
- self.station_coords
A pandas.DataFrame of station latitude and longitude.
- Type:
pandas.DataFrame
- self.metrics_df
A pandas.DataFrame of predictions, error bounds and supporting data.
- Type:
pandas.DataFrame
- self.unified_gdf
A geopandas.GeoDataFrame containing statistical data and geospatial data.
- Type:
geopandas.GeoDataFrame
- self.aoi_outline
A geopandas.GeoDataFrame of the outline of the area of interest (aoi).
- Type:
geopandas.GeoDataFrame
- add_metrics_to_stations()
Merges self.station_coords and self.metrics_df on station name with the result set to self.metrics_df.
- Returns:
None
- convert_crs(crs='EPSG:27700')
Converts the crs of self.house_price and self.station_coords to crs.
- Kwargs:
- crs (str): A valid crs string to be passed to
geopandas.GeoDataFrame.to_crs(), default EPSG:27700, (recommended for the UK).
- Returns:
None
- extract_station_metrics(gev_dict)
Extracts mean predicted return intensity and the 95% HDI from an instance of gev_dict and the corresponding return periods.
- Parameters:
gev_dict (dict) – An instance of gev_dict generated by fitting extreme events to the generalized extreme value distribution using PyMC. NB gev_dict is generated by the accompaning Jupyter Notebook.
- Returns:
- A pandas DataFrame
containing only the information needed for carrying out analysis in the accompaning Jupyter noteobook.
- Return type:
station_metrics (pd.DataFrame)
- get_error_bounds(df, risk_var='Return intensity', targ_col='Risk-adjusted house price', hdi_cols=['Lower', 'Upper'])
This methods takes a df containing the 95% HDI and uses the ratio of the lower and upper bounds to calculate error for risk-adjusted house prices. It assigns the same ratio between mean rainfall prediction and the 95% HDI interval to the risk-adjusted house prices
- Parameters:
df (pandas.DataFrame) – A pandas.DataFrame containing the 95% HDI and uses the ratio of the lower and upper bounds to calculate error for risk-adjusted house prices.
- Kwargs:
- risk_var (str): The name of the column containing the risk
variable in df.
- targ_col (str): The name of the column continaing the
risk-adjusted values in df.
- hdi_cols (list): A list of length 2 containing the names in
the HDI columns in df order must be lower bound followed by upper bound.
- get_metrics_df(gev_dict)
Updates self.metric_df with mean prediction and lower and upper bound of the HDI (by default 95%) by extracting this information from gev_dict.
- Parameters:
gev_dict (dict) – A dict containing results modelling results from the accompaning Jupyter Notebook.
- Returns:
None
- get_return_period_results(station_name, df, return_period_yrs=20)
This method takes a pandas DataFrame containing the predicted return period, intensity and confidence intervals and returns a single row DataFrame containing this metric in addtion to adding a column with the site name.
- Parameters:
station_name (str) – The name of the met station.
df (pandas.DataFrame) – A pandas.DataFrame containing the results of the posterior preditive containing the return period, return intensity and HDI (95% is used in this example).
- Kwargs:
- return_period_yrs (int): The return period of interest,
default 20.
- Returns:
- A pandas.DataFrame containing the
return intensity, lower and upper intervals for return_period_yrs.
- Return type:
df (pandas.DataFrame)
- get_risk_adj_df(df=None, risk_var='Return intensity', adj_col='Average_Price', hdi_cols=['Lower', 'Upper'], risk_weight=0.5, risk_threshold=160)
Returns risk-adjusted house prices, lower and upper estimates. Lower and upper estimates are based on the ratio of the HDI lower/upper estimates for risk var. Risk weight determines the largest value of the risk factor, risk threshold set the level which must be exceeded for risk-adjustment to take place (in [mm] rainfall).
- Kwargs:
- df (pandas.DataFrame): A pandas.DataFrame containing the
risk varible predictions, HDI bounds and house prices.
- risk_var (str): The name of the risk variable in df,
default value ‘Return intensity’.
- adj_col (str): The name of the column containing values to
be risk-adjusted in df, default ‘Average_Price’.
- risk_weight (float): The weighting that risk_var should
carry, larger values give a greater weighting. This value is passed to self.minmax_normalizer(). Default is 0.5.
- risk_threshold (int): The value that should be exceeded for
risk-adjustment to occur, default 160 (mm rainfall).
- Returns:
- A pandas.DataFrame containing
risk-adjusted values and HDI bounds appendeded to df.
- Return type:
df (pandas.DataFrame)
- get_risk_adjusted_hp(df, adjustment_df, threshold=160)
This method gets risk-adjusted house prices for stations with rainfall above threshold. adjustment_df is a pandas.DataFrame of normalized rainfall.
- Parameters:
df (pandas.DataFrame) – A pandas.DataFrame containing return intensity and house prices.
adjustment_df (pandas.DataFrame) – A pandas.DataFrame generated by self.minmax_normalizer.
- Kwargs:
- threshold (int): The value that ‘Return intensity’ should
exceed for risk-adjustment to take place, default value 160 (mm).
- Returns:
- A pandas.DataFrame which is df
with additional columns containing risk-adjustment in addition to a column contain the values used to peform risk-adjustment.
- Return type:
df (pandas.DataFrame)
- join_house_prices(cols_to_keep=None)
Joins self.house_price and self.station_coords using geopandas.sjoin_nearest() and sets the result to self.unified_gdf. The result is that met station will be assigned to their geographically nearest house price region. Also drops unneeded columns, a list of column names can be specified using cols_to_keep to change which columns are kept.
- Kwargs:
- cols_to_keep (list-like): A list of columns names to be kept
in the resulting GeoDataFrame.
- Returns:
None
- minmax_normalizer(df, risk_var='Return intensity', hdi_cols=['Lower', 'Upper'], upper_bound=0.5)
Min-Max normalizer which normalizes values to have a max value of ‘upper_bound’. Returns pandas DataFrame of the original values and the normalized values.
- Parameters:
df (pandas.DataFrame) – A pandas.DataFrame containing the data to be normalized.
- Kwargs:
- risk_var (str): The name of the column to be normalized,
default ‘Return intensity’.
- hdi_cols (list): A list of length 2, where each element is a
string of the name of the columns containing the lower and upper HDI, respectively. Default value is `[‘Lower’, ‘Upper’].
- upper_bound (float): Specify the largest value, effectively
control the scale of the normalizer.
- Returns:
- A pandas.DataFrame
containing normalized values.
- Return type:
adj_multiplier (pandas.DataFrame)
- remove_nans()
Removes rows from self.unified_gdf which the column ‘Average_Price’ is NA and resets the index of self.unified_gdf. NB calling this method will change self.unified_gdf in place.
- Returns:
None
- tidy_data(gev_dict)
Performs data munging on a gev_dict instance.
- Returns:
None
- class gev_utils.DataImportUtils.WebScraper(base_url)
Bases:
objectScrapes met station data and parses data into a usable format.
- self.base_url
The base url to be scraped
- Type:
str
- self.station_links
A dictionary where keys are station names and values are links to station data.
- Type:
dict
- self.station_dict
A dictionary where keys are station names and values are a list of data and metadata.
- Type:
dict
- self.nonstandard_list
A list of stations where the format of the metadata does not permit automatic parsing.
- Type:
list
- get_coordinates()
Attempts to extract coordinates from station metadata. If this fails, add the station name to self.nonstandard_list.
- Returns:
None
- get_data()
Scrapes met station data (measurements) and metadata (coordinates, whether a station was relocated) from self.base_url. Modifies self.station_dict values to contain a list of the station’s link, a pandas.DataFrame of station measurements and a list metadata.
- Returns:
A dict where keys are station names and is a list of station data and metadata.
- Return type:
self.station_dict (dict)
- get_stations_and_links()
Scrapes base_url for met station names and met station links. Updates self.station_links and creates a dictionary where keys are met station names and values are met station links.
- Returns:
None
gev_utils.PlottingUtils module
- class gev_utils.PlottingUtils.Plotter(gb_outline_gdf, gev_dict, data_dict)
Bases:
objectCreates visualizations. In general this class takes data and instances of matplotlib.figure.Figure and matplotlib.axes.Axes, it will create a visualization and return the modified fig and ax(es) instances for further fine tuning.
- self.gb_outline_gdf
An outline of the island of Great Britain.
- Type:
geopandas.GeoDataFrame
- self.gev_dict
The results of statiscal modelling created by the accompanying Jupyter notebook.
- Type:
dict
- self.data_dict
The instance of data_dict created by DataImportUtils, containing data to be used in visualizations.
- Type:
dict
- get_difference(df, targ_col='Risk-adjusted house price', hdi_cols=['Lower_adj', 'Upper_adj'])
Helper method to get the absolute difference between values and hdi bounds as the absolute difference and not the values themselves are needed for matplotlib.pyplot.axes.errrorbar.
- Parameters:
df (pandas.DataFrame) – A pandas.DataFrame of the risk-adjusted values generated by DataImportUtils.
- Kwargs:
- targ_col (str): A string of the name of the column
containing risk-adjusted values, default ‘Risk-adjusted house price’.
- hdi_cols (list): A list of length 2 containing the name of
the columns with the risk-adjusted lower and upper bounds, respectively. Default [‘Lower_adj’, ‘Upper_adj’].
- Returns:
- A pandas.DataFrame
containing data used to add error bars to plots.
- Return type:
error_bar_df (pandas.DataFrame)
- plot_choropleth(geomunger, fig, axes)
Plots 3 choropleth maps in one figure, the first of Return intensity, the second is of the risk variable (average house price), and the third is of the risk-adjusted house price.
- Parameters:
geomunger (class.GeoMunger) – An instance of GeoMunger.
fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.
axes (an array of matplotlib.axes.Axes) – The array of matplotlib.axes.Axes.
- Returns:
- The modified instance of
matplotlib.pyplot.Fig.
- axes (matplotlib.axes.Axes or array of Axes): The modified
array of Axes of matplotlib.axes.Axes instances.
- Return type:
fig (matplotlib.figure.Figure)
- plot_posterior(idata, ax)
Plots posterior predictive check.
- Parameters:
idata (arviz.InferenceData) – An instance of arviz.InferenceData.
ax (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.
- Returns:
- The modified instance of
matplotlib.axes.Axes.
- Return type:
ax (matplotlib.axes.Axes)
- plot_prior_check(idata, ax, xlim=[0, 180], ylim=[0, 0.4])
Plots samples from the priors
- Parameters:
idata (arviz.InferenceData) – An instance of arviz.InferenceData.
ax (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.
- Kwargs:
- xlim (list): a list to be passed to matplotlib.pyplot.xlim,
default value [0, 180].
- ylim (list): a list to be passed to matplotlib.pyplot.ylim,
default value [0, 0.4].
- Returns:
- The modified instance of
matplotlib.axes.Axes.
- Return type:
ax (matplotlib.axes.Axes)
- plot_prior_post_prediction(station_name, fig, axes)
Plots the prior predictive, the posterior predictive and predictions. Styles plots using climatex.mplstyle (hard-coded).
- Parameters:
station_name (str) – A string of the station name key in gev_dict.
fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.
axes (an array of matplotlib.axes.Axes) – The array of matplotlib.axes.Axes.
- Returns:
- The modified instance of
matplotlib.pyplot.Fig.
- axes (matplotlib.axes.Axes or array of Axes): The modified
array of Axes of matplotlib.axes.Axes instances.
- Return type:
fig (matplotlib.figure.Figure)
- plot_return_period(post_pred, return_periods, ax)
Plots a graph of the return intensity for each return period with 95% HDI.
- Parameters:
post_pred (arviz.InferenceData.posterior) – An instance of arviz.InferenceData.posterior.
return_periods (numpy.array) – The numpy array used to generate predictions for the posterior.
ax (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.
- Returns:
- The modified instance of
matplotlib.axes.Axes.
- Return type:
ax (matplotlib.axes.Axes)
- plot_risk_adj_bar(df, fig, ax, index=None)
Plots risk-adjusted house prices and error bars, df is the result of GeoMunger.get_risk_adj_df().
- Parameters:
df (pandas.DataFrame) – A pandas.DataFrame containing risk-adjusted house prices and statistical predictions.
fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.
axes (matplotlib.axes.Axes) – An instance of matplotlib.axes.Axes.
- Kwargs:
- index (pandas.DataFrame.index): An instance of
pandas.DataFrame.index to be used for the y-axis (NB this is a horizontal bar plot, so these are station names).
- Returns:
- The modified instance of
matplotlib.pyplot.Fig.
- axes (matplotlib.axes.Axes): The modified
matplotlib.axes.Axes instance.
- df.index (pandas.DataFrame.index): The
pandas.DataFrame.index used a labels for the plot, these are station names.
- Return type:
fig (matplotlib.figure.Figure)
- plot_ts(station_name, fig, axes)
Plots the time series with annual maxima highlighted and the probability density function estimated using Gaussian kernals.
- Parameters:
station_name (str) – A string of the met station name.
fig (matplotlib.figure.Figure) – An instance of matplotlib.pyplot.Fig.
axes (matplotlib.axes.Axes or array of Axes) – An instance of matplotlib.axes.Axes.
- Returns:
- The modified instance of
matplotlib.pyplot.Fig.
- axes (matplotlib.axes.Axes or array of Axes): The modified
array of Axes of matplotlib.axes.Axes instances.
- Return type:
fig (matplotlib.figure.Figure)