Kingwood Data Privacy FAQ

by John D. Cook, PhD

Table of Contents

1. General data privacy questions

1.1. What’s wrong with the nothing-to-hide argument?

Why should anyone care about privacy, especially people who feel they have nothing to hide?

One problem with is that it assumes that there are no false positives. Strangers who make inferences about you based on your personal data – law enforcement, insurance companies, employers, etc. – are never wrong in their conclusions. Or if they are wrong, mistakes will be quickly and effortlessly resolved. Experience shows that every inference has some error rate, and that resolving errors can be painful or impossible. More on this here.

1.2. Does removing names make data deidentified?

Not at all. Removing names does little good if the data still contain the information needed to infer names. It can be surprisingly easy to identify people in data which has had obvious identifiers removed.

2. HIPAA, Expert Determination, and Safe Harbor

2.1. Is there more to Safe Harbor than 18 rules?

The Safe Harbor provision of the HIPAA Privacy Rule does indeed list 18 kinds of data to remove in order to deidentify data. While most of the rules are fairly objective, the 18th rule says to remove “any other unique identifying number, characteristic, or code.” How do you know whether a characteristic is identifying? There is also the so-called 19th rule which is also open-ended.

2.2. Does Safe Harbor really protect privacy?

It may or may not. As noted above, Safe Harbor is a little fuzzy. One could even argue after-the-fact that if privacy wasn’t protected, something must have gone wrong with the 18th or 19th rules. But there have been data sets which complied with the objective portions of the Safe Harbor provisions and yet which allowed individuals to be identified.

Any method of deidentifying data leave some risk that an individual in the data may be identified, but ideally this risk should be very small. Whether the risk is indeed very small depends on more context than is present in the Safe Harbor rules.

2.3. Why does Safe Harbor remove dates of service?

In a nutshell, dates of service can sometimes be cross-referenced with publicly available data in order to identify individuals. More details are available here.

2.4. What is a “covered entity” under HIPAA?

According to the HIPAA statute (45 CFR 160.103) “Covered entity means: (1) A health plan. (2) A health care clearinghouse. (3) A health care provider who transmits any health information in electronic form in connection with a transaction covered by this subchapter.”

More on who is and isn’t a covered entity according to U. S. Department of Heath and Human Services.

2.5. What is a business associate?

According to HHS, “A ‘business associate’ is a person or entity that performs certain functions or activities that involve the use or disclosure of protected health information on behalf of, or provides services to, a covered entity.”

3. Artificial intelligence and natural language processing

3.1. How effective is software at removing PII from free text?

It varies a great deal. We have evaluated software that does an excellent job of removing personally identifiable information. We have also evaluated software that flagrantly failed to do so.

See this article for more information, including some reasons why NLP software may not perform as well in practice as it is assume to perform in theory.

3.2. Can large language models leak personal information?

This has happened in practice. This article gives a hypothetical example of how this might happen.

4. Cryptography

4.1. Does hashing attributes protect privacy?

Hashing may indeed protect private information, but it can also fail in subtle ways.

Cryptographical hashing algorithms make it impractical to infer the input of the algorithm from its output, provided we know nothing about the input. But if the data come from a small set of possible values, and if the hashing algorithm is known, then it is possible to hash all possible values, making what is known as a “rainbow table.”

For example, there are only 50 US states. If US state of residence is hashed and the hash value used as a database field, anyone could hash the 50 state names and see which one corresponds to each hashed value.

Even for fields with many more possible values, such as phone numbers, it is feasible to create a rainbow table by exhaustively hashing the values. However, if you concatenate several attributes together before hashing, the universe of possible inputs may be too large to exhaustively hash.

One way to make a rainbow table attack less feasible is to use a key with the hash so that in effect the hashing algorithm is unknown. (The hashing algorithm could be a standard algorithm like SHA-256, but if the data is XOR’d with a private key before hashing, then in a sense the hashing algorithm is unknown.) Another way to thwart rainbow table attacks is to use an algorithm designed to be time-consuming or memory-consuming, such as Argon2.

Sometimes it is possible to infer the input to a hash function a posteriori even if it is not possible to pre-compute the hash values. For example, if someone hashed US states of residence, using an expensive hash function with a private key, one could still infer, for example, that the hash value that appears most frequently in the data is likely to be California since that is the most populous state.

4.2. Does metadata pose a privacy risk if the content is encrypted?

Suppose you know that someone called a drug abuse hotline one night, and called several drug rehabilitation facilities the next day. You know what phone numbers they called and how long each call lasted. But the content of the call was encrypted. What do you suppose the phone calls were about?

5. State privacy laws

5.1. How do US states extend HIPAA?

The US federal government basically defines a “covered entity” as a health care provider, a health plan, or a health care clearninghouse. But the state of Texas extends the definition to include any business “assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information” in the Texas Medical Records Privacy Act.

There are many state privacy laws, the most well-known being the California Consumer Privacy Act or CCPA. However, there are many other state laws to be aware of:

  • California Privacy Rights Act
  • Colorado Privacy Act
  • Connecticut Personal Data Privacy and Online Monitoring Act
  • Delaware Personal Data Privacy Act
  • Indiana Consumer Data Protection Act
  • Iowa Consumer Data Protection Act
  • Montana Consumer Data Privacy Act
  • Oregon Consumer Privacy Act
  • Tennessee Information Protection Act
  • Texas Data Privacy and Security Act
  • Utah Consumer Privacy Act
  • Virginia Consumer Data Protection Act
  • Washington Biometric Privacy Law

5.2. Is there such a thing as expert determination for CCPA?

California’s CCPA makes references to HIPAA, as do other state laws. In particular, California’s AB 713 refers to “The deidentification methodology described in Section 164.514(b)(1) of Title 45 of the Code of Federal Regulations, commonly known as the HIPAA expert determination method.”

Ask your legal counsel how state laws are relevant to your business.

6. GDPR

6.1. What is the right to be forgotten?

The right to be forgotten means that someone may ask to have their information removed from certain data repositories. This sounds simple, but it is fraught with technical difficulties if not logical contradictions. A business may be required by one law to remove data and required by another law to retain data.

You can read more about the right to be forgotten as illustrated by the fictional George Bailey and the real Barbra Streisand.

6.2. What is Pseudonymization?

Many people have defined pseudonymization in varying ways, but the clearest definition may be reversible de-identification. That is, identifiers have been replaced with substitutes, and someone knows how recover the original data from these replacements. The data have not been permanently anonymized. That’s one way the term is used, but unfortunately there are others. See this page for a discussion.

6.3. What is anonymization?

Anonymization and de-identification mean essentially the same thing. They may mean exactly the same thing, or there may be subtle differences, depending on whose definition you’re going by.

7. Data breaches and incidents

7.1. What is a computer security incident?

According to the National Institute of Standards, “a computer security incident is a violation or imminent threat of violation of computer security policies, acceptable use policies, or standard security practices.”

7.2. What is a privacy incident?

According to the US Centers for Medicate & Medicaid Services, “A privacy incident is any event that has resulted in (or could result in) unauthorized use or disclosure of PII/PHI where persons other than authorized users have access (or potential access) to PII/PHI, or use it for an unauthorized purpose.”

7.3. What is a breach?

According to the HHS, “A breach is, generally, an impermissible use or disclosure under the Privacy Rule that compromises the security or privacy of the protected health information.” However, the HHS goes on to say “There are three exceptions to the definition of ‘breach.’ …”

For the precise legal meaning of breach, and to determine whether your company has suffered a breach according to the legal definition, please consult your attorney.

7.4. What is required of a covered entity after a breach?

The HHS discusses breach notification requirements under HIPAA here. Note that other laws and regulations may apply besides HIPAA. Consult your attorney for details and advice.

We can help you assess the privacy implications of a data incident or breach, working with your legal team to determine how to proceed. This involves evaluating whether the data could be considered deidentified and give you an idea whether or how an attacker could use the data. We can also advise you on how to prevent privacy breaches in the future.

7.5. How common are data breaches?

There are thousands of data breaches a year, exposing data on hundreds of millions of people.

The largest breach to date has been the 2021 Facebook breach, leaking data on half a billion Facebook users.

8. Data privacy techniques

8.1. What is randomized response?

If someone, or software on their behalf, adds some randomness to their data, the aggregate data from many users might still be quite useful, even if individual responses are not accurate. This can be used, for example, to give survey respondents plausible deniability. More on randomized response here.

8.2. What is synthetic data?

A synthetic data set is a set of fictional data, generated to have the same characteristics as a set of real data on which it is based.

Creating a synthetic data set may provide excellent privacy protection: no individual’s data is being revealed. But if the process of creating the synthetic data set is flawed, data on some users may slip through the process relatively unchanged.

There’s a subtle problem of deciding what features of the original data set to retain. A model is fit to the original data, then new data is generated using that model. Did the model retain the features important for your purposes? It’s hard to say. The point of creating a synthetic data set is that the recipient’s use of the data is unknown. If the use was known, someone could run that analysis rather than creating a synthetic data set.

8.3. What is k-anonymity?

The idea of k-anonymity is that every database record appears in the database at least k times. That sounds simple, but which records should you focus on? In a database table with hundreds of columns, every single record may be unique. In practice people mean that each record appears k times when you focus on some set of potentially identifying attributes. Which attributes should those be? What value of k is advisable in a particular context?

8.4. What is ?-diversity?

?-diversity is a variation on k-anonymity indented to address some of the weaknesses of the latter.

Suppose you apply k-diversity on quasi-identifiers to a data set and a set of k people who share the same quasi-identifiers also share the same value of an additional attribute. The data set does not leak identity, but it does leak information. ?-diversity addresses this issue.

8.5. What is t-closeness?

See this article.

8.6. What is earth-mover distance?

See this article.

8.7. What is differential privacy?

Differential privacy is a relatively new approach to data privacy, one that seeks to quantify and minimize the effect that any one person has on inferences drawn from a particular data set.

See an introduction to differential privacy here.

8.8. What are the advantages of differential privacy?

One of the strengths of differential privacy is that it treats all data as equally sensitive. This is also one of its weaknesses. It is a conservative approach to data privacy, but in some cases it can be too conservative.

Differential privacy adds randomness to query results, calibrating the amount of randomness to correspond to the sensitivity of the query. This is done automatically. No one has to determine a priori how sensitive all possible queries are; the software implementing differential privacy determines this only the fly using the actual data. No one has to anticipate what might be identifiable and possibly guessing incorrectly.

Here is an example of differential privacy in action, making it possible to answer questions about dates that would not be allowed under HIPAA Safe Harbor.

8.9. How can competitors share data without giving up their data?

See this article.

8.10. Does differential privacy scale up?

Absolutely. The US Census Bureau applied differential privacy to the 2020 US census. More on that here.

8.11. Does differential privacy scale down?

This is more of a challenge. Differential privacy does not work well with very small databases: the amount of randomness needed to secure queries may be too high.

How small is too small? That depends on the nature of the data. We could help you answer that question.

8.12. What is a privacy budget?

A privacy budget is the way differential privacy keeps track of the cumulative risk to privacy from running multiple queries. More on privacy budgets here.

9. Privacy risks

9.1. What are toxic pairs?

Toxic pairs are unusual combinations of attributes that could yield clues to someone’s identity. Maybe each item alone is unremarkable, but the combination is remarkable. More on toxic pairs here.

9.2. Can you really identify most people from their zipcode, date of birth, and sex?

Yes. Latanya Sweeney demonstrated this in 1997 by providing the then-governor of Massachusetts William Weld with medical records she obtained by knowing Weld’s zipcode and birth date.

More on how easily people can be identified based on this information here.

9.3. Does an expired credit card tell you anything about the former cardholder?

The credit card number shows what bank you were using, which may well be your current bank. It may also reveal the first few digits of your new credit card number. More on that here.

9.4. Is your personal data safe with a company that promises not to sell it?

Given how frequently companies inadvertently give your data to hackers, it may not matter whether they keep their promise to never sell the data.

Aside from being hacked, companies have ways of getting around promises not to sell your data. For example, they may barter your data, trading it to another company for some kind of compensation other than cash. The company who got the data in the trade may then sell it unless that was prohibited.

9.5. Is there any privacy risk in revealing the last four digits of your SSN?

Yes.

9.6. How can missing data be an identification risk?

In some cases the missing data may be filled in by logical inference.

9.7. What can go wrong if an ID number is computed from personal data?

US states used to compute your drivers license number from other personal information. This made it possible to either compute your drivers license number or use your drivers license number to infer other data. More on this here.

States no longer to this, but other government agencies or private companies might. Presumably some do. Computing IDs sounds like a good idea; a generation ago a lot of people thought it was a good idea and saw no problem with it. Surely not everyone has learned from the mistakes of the past.

9.8. Can you identify someone from medical images?

This is a difficult question because it depends on context. Medical textbooks are filled with images that presumably do not compromise anyone’s identity. But if a small set of images is known to belong to a small set of people, it might be possible for someone to match some images to some people.

In general it can be hard to identify people from medical data. See, for example, this article on attempting to identify people from electrocardiogram data.

On the other hand, something line a fitness tracker could give clues to a person’s identity.

9.9. What can go wrong with posting photos?

Photos can contain a large amount of metadata, such as EXIF (Exchangeable Image File Format) metadata. In addition, you’d be surprised what location clues are in a seemingly innocuous photo. There are people who have a hobby of identifying locations from photos from clues in the background.

9.10. How can trying to protect your privacy backfire?

This is known as the Streisand Effect.

10. Online privacy

10.1. Does the browser option “Do not track” help?

Not really. Companies routinely ignore “do not track” (DNT) requests. In fact, since most users do not select DNT, selecting the DNT makes it easier to identify you. This is the reason Apple gave for removing the feature from its browsers.

A German court recently ruled against LinkedIn for ignoring DNT. Maybe this will result in more companies honoring it. Or maybe the LinkedIn case is the exception that proves the rule: it is so common to get away with ignoring DNT that it’s newsworthy when a company is held accountable.

10.2. What is browser fingerprinting?

The information about a computer revealed by a web browser – the fonts installed, the operating system version, the browser version – is often unique. Fonts alone are usually enough to uniquely identify a browser. See this article for more on font fingerprinting.

10.3. Does Tor protect your identity?

It didn’t protect the identity of Ross William Ulbricht who used the pseudonym Dread Pirate Roberts on the Silk Road network.