Data Confidentiality

Data Confidentiality

The Code of Practice for Statistics (CoP) and specifically T6 Data governance, and Government Statistical Service (GSS) Guidance on Anonymisation and data confidentiality set out the principles for how we protect data on individuals from being disclosed.

Introduced Random Error

Introduced random error is used with Stat-Xplore to ensure that no data are released which could risk the identification of individuals in the statistics.

Many classifications used within Stat-Xplore have an uneven distribution of data throughout their categories, in particular across geographical areas. When geographical area is cross-tabulated with other breakdowns, such as age / gender / family type, the number in the table cell could be small. These small numbers increase the risk of identifying individuals in the statistics.

Even when variables are more evenly distributed in the classifications, the problem still occurs. The more detailed the classifications, and the more of them that are applied in constructing a table, the greater the incidence of very small cells.

Care is taken in the specification of tables to minimise the risk of identifying individuals. In addition, a technique has been developed to randomly adjust cell values. Random adjustment of the data is considered to be the most satisfactory technique for avoiding the release of identifiable data. When the technique is applied, all cells may be slightly adjusted to prevent any identifiable data being exposed. These adjustments result in small introduced random errors. However, the information value of the table as a whole is not impaired. The technique allows very large tables, for which there is a strong customer demand, to be produced even though they contain numbers of very small cells.

It is not possible to determine which individual figures have been affected by random error adjustments, but the small variance which may be associated with derived totals can, for the most part, be ignored.

No reliance should be placed on small cells as they are impacted by random adjustment, respondent and processing errors.

Many different classifications are used in Stat-Xplore tables and the tables are produced for a variety of geographical areas. The effect of the introduced random error is minimised if the statistic required is found direct from a tabulation rather than from aggregating more finely classified data. Similarly, rather than aggregating data from small areas to obtain statistics about a larger standard geographic area, published data for the larger area should be used wherever possible.

When calculating proportions, percentages or ratios from cross-classified or small area tables, the random error introduced can be ignored except when very small cells are involved, in which case the impact on percentages and ratios can be significant.

As a result of improvement activities on DWP's data platforms, a necessary update is being implemented to the methodology that generates seeding used to randomly apply statistical disclosure control, which may result in small differences in the introduced random error applied to prevent disclosure. This is being applied incrementally across Stat-Xplore products, where disclosure control is applied, from November 2023 onward. There is no material impact for users.

The introduced random error method applied to the data from August 2024 will have a small impact on already published historic figures across Stat-Xplore releases. This update was necessary to enable a move to a more advanced data processing platform to maintain data confidentiality.

Surveys

Please note that the Households Below Average Income (HBAI), Pensioner Incomes (PI) and Family Resources Survey (FRS) datasets do not use introduced random error as they are survey-based data - the underlying sample of individuals in their corresponding surveys have been 'grossed' to represent a group of individuals with those characteristics in the whole population. As well as this grossing, only broad categorical information is used, so an individual person cannot be identified in the data.