Rules of Aggregation • mscatch

After pulling the data (length sample data and the landings data) for a particular species a user can now use the aggregate_landings()` function to aggregate the landings and the length data. The user has several choices to make that determine how the data should be processed. All decision points are written to a logfile for the user to inspect. Many of the steps below are dictated by user inputs in the form of function arguments.

Gear

Gear is defined using the field NEGEAR which characterizes gear by a three character code. A list of gear types with their codes and descriptions can be found using comlandr::get_gears()

This step ignores time (YEAR, QTR), MARKET_CODE, and length distributions and looks at aggregated landings by gear type to determine dominant gear types in fishery. The resulting number of distinct gear types to be selected is controlled by a user defined parameter indicating the total percentage of landings to be accounted for by distinct gear types. (The remaining gears are combined into an otherGear category)

For example, using a value of 95% (landingsThresholdGear = .95) would select the distinct gear types, when ordered by landings, for which the cumulative sum > 95% of total landings. All other gear types (those comprising < 5% of total landings) would be combined into an otherGear category.

In the table, the gear types (050,010,100) would be retained. All other gear types would be combined together.

NEGEAR	Landings	Cumulative	Proportion
050	1141407667	1141407667	0.8285119
010	95196644	1236604311	0.8976121
100	93069581	1329673892	0.9651684
020	30155056	1359828948	0.9870570
056	9405172	1369234120	0.9938839
132	1773276	1371007396	0.9951711
057	1630702	1372638098	0.9963548
160	1462198	1374100296	0.9974161

The length distributions (aggregated over time, MARKET_CODE) are then tested against each other (using the Kolmogorov-Smirnov test) at a predetermined significance level (pValue=0.05) to determine if they are significantly different. Any NEGEAR codes that are found to be statistically insignificant are aggregated.

If a species_object is used the steps above are skipped since all gear aggregations are predetermined by the user

Market Category

Market categories are defined by the field MARKET_CODE which characterizes market codes by a two character code. A list of market codes for a species can be found using comlandr::get_species_itis()

Any MARKET_CODEs that do not have any length samples (when aggregated over YEAR, QTR) are treated first. The user has the option to aggregate these market codes with any other market code. The default option is to relabel these as unknowns, “UN”. Often these MARKET_CODEs have a small amount of landings attributed to them and the impact of this aggregation is negligible
The length distributions for the remaining MARKET_CODEs (aggregated over YEAR, QTR) are then tested against each other (using the Kolmogorov-Smirnov test) at a predetermined significance level (pValue=0.05) to determine if they are significantly different. Any MARKET_CODEs that are found to be statistically insignificant are aggregated.

If a species_object is used the steps above are skipped since all market code aggregations are predetermined by the user

The user has the option of stopping at this point (borrowLengths=F). The data returned will either be at the QTR or YEAR level depending on user inputs.

Time

The user supplies the level in which landings should be aggregated, (YEAR, QTR, or SEMESTER). All combinations of Time, NEGEAR, MARKET_CODE that do not have associated length samples borrow samples from the nearest neighbor in time. Future development: Include nearest neighbor based on spatial units and/or GEAR

Before the borrowing of length samples commences, any gear types found to not have any length samples are aggregated with the otherGear category (This does not occur if species_object is used)

The borrowing of length samples from one time interval to use in another time interval can be a subjective process that differs among species based on life history traits. To complicate matters length distributions sampled within MARKET_CODEs may have shifted over time. These issues will be dealt with at a future date. A generalized method is currently applied (Future development: Include additional options based on life history traits and fishing industry changes)

Aggregate landings to QTRs and borrow length samples where necessary ¹
Aggregate landings to SEMESTERs and borrow length samples where necessary
Aggregate landings to YEARs and borrow length samples where necessary
Aggregate landings in each year to either YEAR, SEMESTER, QTR depending on availability of length samples in close proximity to the target year. Future development

¹ The method of borrowing length samples is complex and the user has several options. These methods are described below.

Aggregate to YEAR

The following are performed sequentially:

The landings and length sample data are aggregated over QTR to the the annual (YEAR) level
For each gear type (NEGEAR) the Kolmogorov-Smirnov test is applied in each YEAR (Future work) to test for differences in the length distributions among MARKET_CODEs. Any MARKET_CODEs with length distributions found to be statistically insignificant can be aggregated by the user. The default is no aggregation of MARKET_CODEs.
For each gear type (NEGEAR) and market code: A YEAR with missing length samples borrows length samples from the YEAR closest to it in time (in the future or the past)
For any NEGEAR/MARKET_CODE combinations where the number of length samples is less than the user defined value (nLengthSamples), for all YEARS, length samples are borrowed from another NEGEAR type with the same MARKET_CODE. Samples are borrowed from the NEGEAR with the most landings (since it is assumed they will have the most length samples). If there are insufficient length samples, the next NEGEAR in order of total landings is checked. And so on.
The sampling program for most species started several years after landings were first recorded. So there is a stretch of consecutive years in the early part of the time series without any length samples. These years are assigned length samples from the first year in which samples were taken.

Aggregate to QTR

For YEARs where there are landings before any length samples were taken all YEAR/QTRs are assigned the length samples from the first YEAR in which length samples were taken (from the same QTR)
For YEARs when QTRs have missing length samples the previous years length samples for the same QTR are assigned. If there are not length samples in the previous year then the process is repeated until a length sample is available. If this still results in no length samples then the nearest neighbor is used. Length samples from the nearest QTR (any QTR) are used. If this still results in no length samples then the nearest neighbor is used from the NEGEAR type that is attributed with the majority of the landings.

Aggregate to SEMESTER

Landings and length data are relabeled from QTR to SEMESTER (SEMESTER 1 = QTR 1 and 2, SEMESTER 2 = QTR 3 and 4). The procedure for borrowing length samples is the same for QTRs

All decisions made regarding where length samples are borrowed from are written to the logfile for later inspecition

Other Gear

In all methods above the otherGear category is by definition sparse. This gear type is always aggregated to annual data.

Unclassified Market Code

In all methods above the landings and length sample data for the unclassified market category, UN, are aggregated to the same level as other MARKET_CODEs.

The unclassified landings for a specific (YEAR, QTR, NEGEAR) are assumed to be a mixture of fish lengths similar to that in the observed MARKET_CODEs. Therefore the length distribution for the unclassified landings are assumed to have the same distribution as the length distributions of the landed fish in all of the known market codes combined.

For each YEAR, QTR, NEGEARs where there are no length samples available then all length samples over all MARKET_CODES are assigned to the Unclassifieds.
For cases where there are no samples for these other MARKET_CODEs then the nearest neighbor in time is used.