Hacker News new | past | comments | ask | show | jobs | submit login
Big data may not know your name, but it knows everything else (wired.com)
190 points by nemoniac on Dec 30, 2021 | hide | past | favorite | 69 comments



Two things I'd like to say here

1. All anonymisation algorithms (k-anonymity, l-divergence, t-closeness, e-differential privacy, (e,d)-differential privacy, etc.) have, as you can see, at least one parameter that states to what degree the data has been anonymised. This parameter should not be kept secret, as it tells entities that are part of a dataset how well, and in what way, their privacy is being preserved. Take something like k-anonymity: the k tells you that every equivalence class in the dataset has a size >= k, i.e. for every entity in the dataset, there are at least k-1 other identical entities in the dataset. There are a lot of things wrong with k-anonymity, but at least it's transparent. Tech companies however just state in their Privacy Policies that "[they] care a lot about your privacy and will therefore anonymise your data", without specifying how they do that.

2. Sharing anonymised data with other organisations (this is called Privacy Preserving Data Publishing, or PPDP) is virtually always a bad idea if you care about privacy, because there is something called the privacy-utility tradeoff: you either have data with sufficient utility, or you have data with sufficient privacy preservation, but you can't have both. You either publish/share useless data, or you publish/share data that does not preserve privacy well. You can decide for yourself whether companies care more about privacy or utility.

Luckily, there's an alternative to PPDP: Privacy Preserving Data Mining (PPDM). With PPDM, data analysts can submit statistical queries (queries that only return aggregate information) to the owner of the original, non-anonymised dataset. The owner will run the queries, and return the result to the data analyst. Obviously, one can still infer the full dataset as long as they submit a sufficient number of specific queries (this is called a Reconstruction Attack). That's why a privacy mechanism is introduced, e.g. epsilon-differential privacy. With e-differential privacy, you essentially guarantee that no query result depends significantly on one specific entity. This makes reconstruction attacks impossible.

The problem with PPDM is that you can't sell your high-utility "anonymised" datasets, which sucks if you're a big boi data broker.


Can advertisers be legally forced to use these mathematical techniques?

Can we perhaps have a trusted third party which anonymizes data for these companies?


I don't think they could be legally forced to use specific algorithms - most laws state the ends ("privacy"), not the means.

In the old world of analog marketing, you had market research companies (Nielsen, Kantar, GfK) to measure the audience and provide benchmarks.

One way to help curb the power of adtech companies would be to force them to let go of measurement. That would require adjustments to privacy laws, creating a specific data processor for audiences role.


The way it usually works, legislature writes the law that states the end, and establishes (or repurposes) an executive agency to implement the means, vesting them with the power necessary to do so. The agency then comes up with specific procedures etc - and they can enforce that.

For example, the federal law in US does not define the procedure to properly destroy a firearm (such that it ceases to be regulated by the relevant laws) - but ATF does, and it's fairly specific: https://www.atf.gov/firearms/how-properly-destroy-firearms

I don't see why the same approach couldn't work here.


Could you not set legislation that to claim anonymity companies have to provide a certain level which is mathematically bounded? And/or as the OP suggested, add transparency? I'd think the combination is better since just the transparency won't make sense to most people and slow them to make informed decisions.


In some ways it's preferable to leave the legislation a little open ended and put the details into a more flexible rule-making process. Then the rules can be updated by knowledgeable people as circumstances change, either to adopt new standards, relax them, or address unforeseen gaps.


Important concepts. Key thing that has changed in privacy in last couple years is that de-identified data has recently been made into a legal concept instead of a technical one, whereby you do a re-identification risk assessment (not a very mature methodology in place yet), figure out who is accountable in the event of a breach, label the data as de-identified, and include the obligation of the recipients to protect it in the data sharing agreement.

The effect on data sharing has been notable because nobody wants to hold risk, where previously "de-identification" schemes (and even encryption) made their risk and obligation evaporate as it magically transformed the data from sensitive to less sensitive using encryption or data masking. Privacy Preserving Data Publishing is sympathetic magic from a technical perspective, as it just obfuscates the data ownership/custodianship and accountability.

FHE is the only candidate technology I am aware of that meets this need, and DBAs, whose jobs are to manage these issues, are notoriously insufficiently skilled to produce even a synthesized test data set from a data model, let alone implement privacy preserving query schemes like differential privacy. What I learned from working on the issue with institutions was nobody really cared about the data subjects, they cared about avoiding accountability, which seems natural, but only if you remove altruism and social responsibility altogether. You can't rely on managers to respect privacy as an abstract value or principle.

Whether you have a technical or policy control is really at the crux of security vs. privacy, where as technologists we mostly have a cryptographic/information theoretic understanding of data and identification, but the privacy side is really about responsibilities around collection, use, disclosure, and retention. Privacy really is a legal concept, and you can kick the can down the road with security tools, but the reason someone wants to pay you for your privacy tool is that you are telling them you are taking on breach risk on their behalf by providing a tool. The people using privacy tools aren't using them because they preserve privacy, they use them because it's a magic feather that absolves them of responsibility. It's a different understanding of tools.

However, it does imply a market opportunity for a crappy snakeoil freemium privacy product that says it implements the aformentioned techniques but barely does anything at all, and just allows organizations to say they are using it. Their problem isn't cryptographic, it just has to be sophisticated enough that non-technical managers can't be held accountable for reasoning about it, and they're using a tool so they are compliant. I wonder what the "whitebox cryptography" people are doing these days...


It's kind of funny. Sure, with intense math you can maybe back into who some people are from an anonymous audience.

Meanwhile the government just asks the ISP what you've been doing and they happily comply.


There is a very big difference here though, at least ostensibly (doesn't matter much to you if the government wants to know where you were two weeks ago at noon).

The government has to prove based on the reasons that we choose in a democracy that we all want /why/ it has an interest in knowing such a thing.

The companies, on the other hand, literally get to know whatever they want and it's up to us to prove why they shouldn't actually know that thing.

Now, if we want to have a debate about which is more abused in practice, or which is more dangerous, I'm all about it. But the difference in access to information based on proving a need, versus proving a harm, is actually quite stark in theory.


I have a question

Are zero-knowledge proofs really zero-knowledge if you do enough of them, can’t you reconstruct?


Do you have a link where these concepts are explained in more detail?


Anecdotally, the only time I've seen a truly anonymized database was in a european genetics research company, due mainly to the rightly high amount of regulation required in the medical field.

There was a whole separate legal entity, with its own board, that did the phenotype measurement gathering and stored the data in a big database on premise. The link between those measurements and the individual's personal identifiable record was then stored in a separate airgapped database which had cryptographic locks implemented (on the data and physical access to the server) so accessing the data took the physical presence of the privacy officers of each of the two companies (the measurement lab and the research lab) and finally what I found at the time to be the unique move; a representative from the state run privacy watchdog.

To be able to backtrack the data to a person, there was always going to be a need to go through the watchdog. Required, not just legally mandated.

All of the measurement data that was stored in the database came from very restricted input fields in the custom software that was made on prem (no long form text input fields for instance, where identifying data could be put in accidentally), and there was a lot of thought put into the design of the UI to limit the possibility that anyone could put identifiable data into the record.

For instance numerical ranges for a specific phenotype where all prefilled in a dropdown, so as to keep user key input to a minimum. Much of the data also came from direct connections to the medical equipment (I wrote a serial connector for a Humphrey medical eye scanner that parsed the results straight into the software, skipping the human element all together).

This didn't make for the nicest looking software (endless dropdowns or scales), but it fulfilled its goal of functionality and privacy concerns perfectly.

The measurement data would then go through many automatic filters and further anonymizing processes before being delivered through a dedicated network pipeline (configured through the local isp to be unidirectional) to the research lab.

Is this guaranteed to never leak any private information? No, nothing is 100%. This comes damn near close to it, but ofcourse would not work in most other normal business situations.


> To be able to backtrack the data to a person, there was always going to be a need to go through the watchdog.

The assumption in the deanonymization literature is that this data is unavailable. So no, you don‘t need to go through any watchdog.


Yes they had to, in case the person giving the data had opted for being notified about some severe medical condition or other revelations that would show up during the analysis process. For those cases, these mappings were kept around, and did require going through the watchdog.


I think you misunderstood my point. There are ways of deanonymizing data, just by looking at the data alone. In fact, this is the standard assumption.

This watchdog stuff is nice for the "good" actors, but irrelevant for adversaries.


I did understand the point perfectly. The mechanism was there simply for the good actors to backtrace the data to the matching person, it's purpose was never to play a part in making the data more anonymous.

If what you meant to say was the more clear statement that adversaries wouldn't need to do that then I'd have agreed with you.

Everything else that was mentioned; the strict processes in determining what data could be stored, what it exposed about the user if anything, eliminating as much of the human input as possible and the post processing of the data before it left the measurement lab. These are the steps that achieved anonymity (as far as everyone believed had been achieved).


In Google we have a bunch of researchers on anonymity and that whole thing is _hard_. I vaguely remember supporting, a couple years ago, a pipeline where logs stripped out of "all PII" came at one end, aggregated data out of the middle... Into an anonymity verifier, that then redirected much of it into /dev/null, because a technique is known to de-anonymise it. And the research on differential anonymity advanced quite a bit in the meantime.


Did that stuff work?


-Anecdotally, at a former employer we had an annual questionnaire used to estimate how content we were.

The results, we were assured, would only be used in aggregate after having been anonymized.

I laughed quite hard when the results were back - the 'Engineer, female, 40-49yrs, $SITE' in the office next door wasn't as amused. All her responses had been printed in the report. Sample size: 1.


At our (fairly large) company, you can query by team (and maybe job role) but it will hide responses where the sample size is smaller than a set number (I think 8 or 10).

So yes it can be done but people have to actually care about it.

The cautionary tale about k-anonymity (from Aaron Schwartz's book I think) is when the behavior of aggregates is also something that should be kept privates - the example was that the morning run of an army base in a foreign country was revealed because enough people did this with their smartwatches on that it formed a neat cluster.


Isn’t location data particularly easy to de-anonymize? I remember reading some research that because people tend to be so consistent with their location, you could deanonymize most people in a dataset with 3 random location samples through the day


This is likely the paper you are referring to:

https://www.nature.com/articles/srep01376

It states that 95% of people can be identified from just 4 location samples.


More details on the Strava Run incident: https://www.bbc.com/news/technology-42853072


In Germany (I think all of the EU), a dataset can only be published if the sample size is at least >=7.


Just fwiw sample size isn’t a robust defence against this kind of attack. Check out Differential Privacy.


Never answer those honestly. More than half the time they aren't there to "help management understand how to do better", rather to purge people who aren't happy. "We don't need employees who don't love it here."


I always say I am unhappy and would take another offer in a heatbeat, and so far that strategy has worked well for me - I usually get offered a good salary bump amd bonus every year. Obviously YMMV.


-Oh, we were already deemed beyond redemption - we'd been a small company, quite successful in our narrow niche, only to be bought up by $MEGACORP.

That was a culture clash. Big time.

Nevertheless, I thought the same thing you did and filled in my questionnaire so that my answers created a nice, symmetrical pattern - it looked almost like a pattern for an arts&crafts project...


Isn't this a trivial example though? You require bucket size >= 10 say.

Full on differential privacy is meant to address harder problems. For example, if you assume that there was a minor bug and people were asked to take the mandatory yearly survey again in a week.

Now if you assume that the responses of people who responded before will stay the same when they re-respond, then by taking the difference between both results you can identify some specific people by looking at which bucket counts changed - this would be how the new guy who joined in that week voted.

Differential privacy techniques are meant to anonymise data in way that you can't even do this.


It's trivial to fix in retrospect, nearly impossible to anticipate all similar corner cases ahead of time, and -- most importantly -- indicative of the type of logical bugs our systems of the world have that permit hacking from script-kiddies to state actors.


It may be nearly impossible to anticipate _all_ corner cases, but come on, GP was talking about there only being a sample size of 1; this is super easy to notice and then just not include in the results.


Can confirm.

I used to have a team and at some point they all had to submit their feedback on my performance. The answers were then fed back to me unattributed, but it was pretty obvious who wrote what.


And - were there "consequences" for these people?


No consequences. There was nothing I didn't know before.

EDIT: I first wanted to say that in any way there would be no consequences, but "there will be no consequences no matter what people write about me" is probably too strong of a statement for anyone to make.


Were there positive comments you didn't expect?


Kinda. I expected some people to leave a good feedback, but theirs was way beyond my expectations.


Even without sample size having these aggregated results makes it very easy to predict who picked what with a modicum of extra information. (Even silly binned personality type.)


We've merged https://news.ycombinator.com/item?id=29734713 into this thread since the previous submission wasn't to the original source. That's why some comment timestamps are older.


If you're interested in this topic, I recommend the chapter on Inference Control in Ross Anderson's excellent book "Security Engineering". It's one of the ones which is freely available on his web site https://www.cl.cam.ac.uk/~rja14/Papers/SEv3-ch11-7sep.pdf


One click removed from an original source that is soaked in second-rate adtech crap.

To my dying day I will regret being one of the architects of this insidious mechanism.

The problem with trying to evade, or defeat, or even sidestep this stuff is that latent-rep embeddings break human intuition in their effectiveness.

There was a time when the uniqueness of one’s signature could move money.

Those pen twitches are still there to see in what order you click on links.


Having a conscience is good, but don't be too hard on yourself old friend.


Just wait until they have your unconscious eye-twitches.

Foveal rendering is required for adequate resolution/refresh of VR/AR due to the bandwidths and GPU calcs involved (e.g. 6kx6kxRGBx2eyesx120Hz=~200Gb/s). Updating only the 20-30deg around the eye's focus allows reducing this by >10x which reduces power/weight and GPU cost dramatically.


Remember that 33 bits of entropy are enough to identify everyone. It may not be legally so, but any data with 33 bits of entropy is technically PII, and you should treat it as such.


That makes no sense, sorry.

Ok, 2^33 > world population, but that doesn't mean that the string "Hello world" is PII.


33 bits of entropy, not just 33 bits


That depends on the encoding, does it not? The binary sequence equal to ASCII "Hello world" might well be PII with many different encodings. By accident, of course, but nevertheless 33 bits of information would be enough.


Unless someone is actually called Hello World. Or perhaps Bobby Tables. ;)


> any data with 33 bits of entropy is technically PII,

Well yes, but actually no.

I can run UUIDgen 33 times and put it on pastebin, via tor. That does not mean I have 128 peoples' worth of PII. In fact I have ~0 bits of PII. If I were to paste one of those numbers in this comment, now they are all linked to kortex, and however much entropy that username has (call it 32, this account is leaky). But you are still short 127 "PIIs". The rest of the entropy means nothing.

Conversely, a US phone number is <24 bits, but 100% PII.

It's not 33 bits of "entropy". It's 33 bits worth of a distribution which can be correlated to an identity. If it's not correlated, it's not PII.


What do you mean here? I am asking because this is potentially useful.


There are ~8,000,000,000 people in the world; that's a ten-digit number so that's the smallest size of number which could count out a unique number for everyone in the world, 9 digits doesn't have enough possible values. If the digit values are based on details about you, e.g. being in USA sets the second digit to 0/1/2, being in Canada and male sets it to 3, being in Canada and female sets it to 4, the last two digits are your height in inches, etc. etc. then you don't have to count out the numbers and give one to everyone, the ten digits become a kind of identifier of their own. 1,445,234,170 narrows down to (a woman in Canada 70 inches tall ... ) until it only matches one person. There are lots of people of the same height so perhaps it won't quite identify a single person, but it will be close. Maybe one or two more digits is enough to tiebreak and reduce it to one person.

Almost anything will do as a tie-break between two people - married, wears glasses, keeps snakes, once visited whitehouse.gov, walked past a Bluetooth advertising beacon on Main Street San Francisco. Starting from 8 billion people and making some yes/no tiebreaks that split people into two groups, a divide and conquer approach, split the group in two, split in two again, cheerful/miserable, speaks Spanish yes/no, once owned a dog yes/no, once had a Google account yes/no, once took a photo at a wedding yes/no, ever had a tooth filling yes/no, moved house more than once in a year yes/no, ever broke a bone yes/no, has a Steam account yes/no, anything which divides people you will "eventually" winnow down from 8 billion to 1 person and have a set of tiebreaks with enough information in them to uniquely identify individual people.

I say "eventually", if you can find tiebreaks that split the groups perfectly in half each time then you only need 33 of them to reduce 8 billion down to 1. This is all another way of saying counting in binary, 1010010110101001011010100101101 is a 33 bit binary number and it can be seen as 33 yes/no tiebreaks and it's long enough to count up past 8 billion. It's 2^33, two possible values in each position, 33 times.

That means any collection of data about people which gets to 33bits of information about each person is getting close to being enough data to have a risk of uniquely identifying people. If you end up gathering how quickly someone clicks a cookie banner, that has some information hiding in it about how familiar they are with cookie banners and how physically able they are, that starts to divide people into groups. If you gather data about their web browser, that tells you what OS they run, what version, how up to date it is, those divide people into buckets. What time they opened your email with a marketing advert in it gives a clue to their timezone and work hours. Not very precise, but it only needs 33 bits before it approaches enough to start identifying individual people. Gather megabytes of this stuff about each person, and identities fall out - the person who searched X is the same person who stood for longer by the advert beacon and supports X political candidate and lives in this area and probably has an income in X range and ... can only be Joe Bloggs.


Related, and fun to think about:

https://www.gwern.net/Death-Note-Anonymity

Solving an anonymous murderer with a supernatural MO, but an ego that betrays him.


This isn’t as meaningful as you think because each bit has to have no correlation with the others, which is hard. Or to put it another way, each bit has to perfectly bisect the population, otherwise you’ll have collisions and a bunch of empty space.


Each bit has to perfectly bisect the population for 33 bits to be sufficient. But that's the minimum and it's meaningful because of how tiny an amount of data it is compared to how much we process all the time with computers.

The time someone takes to get rid of a cookie popup is not a clear signal, it doesn't bisect the population but it has signal in it. Faster suggests fast computer, good mouse or trackpad, youth, health, familiarity with the internet. Slower suggests low quality dirty mouse or trackpad, old age, poor health, unfamiliarity with the internet. There is some data there, towards some sub-groups and away from others. "Doesn't keep snakes" is low signal because that's most people. "Keeps snakes" is high signal because it narrows down to a small sub-group.

At extreme best/worst case _33_ of them can single someone out, but think that we leak thousands and thousands of bits every day, day after day, to all kinds of information processors who sell and aggregate them, and some that we leak are very specific like GPS coordinates or nearby WiFi SSIDs or app login details, that's why I think it's meaningful.

FaceBook patented an idea of identifying photos taken with the same camera by looking at dust, distortion and damage on the camera lens. Imagine you go into a shop, open their app and photograph a QR code for a discount and that picture ties the app user to every public online photo they ever took, and the app does some device fingerprinting too. Then at home you open the web browser and visit a site with some tracking JS, and that ties a probable connection between your home internet IP with the person who was in the store earlier. You open an app and your home GPS coordinates and nearby WiFi signals are tied to the same home internet IP, and public name and address records identify who was in the store earlier. FaceBook buys that advertising data and makes a shadow profile of someone who lives here, shops there, took all these photos and probably knows some of the people who appear in the photos. That's a lot lot more than 33 bits, but it's not a lot to imagine happening for people casually using apps and the internet several times a day every day over the last decade. And no matter how hard you try to guard against it, the wrong 33 bits leaked is all it would take in the worst case to undo all your efforts.


Sure, but again, the 33 bits have to be a set of perfect dimensions. You’re severely downplaying how difficult it is to get them. It’s realistically not any different than saying, “if everyone had a unique ID, it would only take 33 bits to store that.” It’s not very meaningful.

In fact, it would be a major fucking deal if you could publish 33 dimensions that each bisect the population exactly.

“Keeps snakes” is a perfect example of something that isn’t good enough. It’s too infrequently true to be useful in the 33-bit uniqueness identifier.


Jawdropping. So this is the 33 degrees I've heard people throw around. Thank you so much for elaborating in such a detailed and insightful way.


Another way to look at this is that 2^ 33 is about 8.5 billion

>>> pow(2, 33) 8589934592

There are less than 8.5 billion people in the world, so you could just create a map from each person to a number in the set.


Thank you for gearing this explanation for a person versed in software but not data analytics. It helped a lot.


One of the best posts I've seen on HN


Two tangential "yes and" points:

1)

I'm not smart enough to understand differential privacy.

So my noob mental model is: Fuzz the data to create hash collisions. Differential privacy's heuristics guide the effort. Like how much source data and how much fuzz you need to get X% certainty of "privacy". Meaning the likelihood someone could reverse the hash to recover the source identity.

BUT: This is entirely moot if original (now fuzzed) data set can be correlated with another data set.

2)

All PII should be encrypted at rest, at the field level.

I really wish Wayner's Translucent Databases was more well known. TLDR: Wayner shows clever ways of using salt+hash to protect identity. Just like how properly protected password files should be salt+hash protected.

Again, entirely moot if protected data is correlated with another data set.

http://wayner.org/node/46

https://www.amazon.com/Translucent-Databases-Peter-Wayner/dp...

Bonus point 3)

The privacy "fix" is to extend property rights to all personal data.

My data is me. I own it. If someone's using my data, for any reason, I want my cut.

Pay me.


Without digging in to details, I just want to agree that reversing into a clients PII, while not trivial, is certainly possible, and depending on the class of data, one can achieve fidelity levels in the high 80 percentiles.


To me it is crazy selling that data is even legal.


To me it’s crazy that people go to work doing this. Even if you can negotiate the moral issues, how to people find things like adtech interesting?

Some people seem legitimately excited when saying their product helps customers show “relevant experiences” and so on.


Working in adtech was interesting for me because I learned how to work with distributed systems, thought me the importance of optimizing, scaling, instrumentation etc. It's extremely satisfying shaving a ms or two in response times in an app when you have 15 ms in total to respond.

I built for example a data pipeline that ingests Bluekai/Oracle data cloud data and from a big data perspective it's an amazing experience: from dealing with ingesting billions of records per day to making sure you can build audiences based on those records in near real time.

From a privacy perspective it's very scary stuff. When you look at what data Bluekai has, there are hundreds of millions of profiles scattered around hundreds of thousands of segments. Basically you have segments from frozen food shoppers to people that bought a mercedes a class model 2020.

P.S.I can say also the same thing about Google, FB etc. A lot of people that work there are basically working in adtech, especially in Google.


So basically you are saying that even if the domain is something between uninteresting and downright "bad", the fact that it has a lot of technically interesting aspects? I guess I can understand that. That's most likely the reality for a lot of us. Not that the domain we work in is inherently questionable, but that it's just mundane (insurance, manufacturing, whatever)

I feel the same way about most defense tech, for example. For example, I'd have nothing against working on missile guidance systems that actually kill people, but I'd never take a job in adtech. So what we count as questionable is just individual.


"It is difficult to get a man to understand something when his salary depends upon his not understanding it."

Upton Sinclair


If I am not mistaken, yesterday there was a top comment that started "Adtech veteran here." Perhaps surveilling people on the internet distorts one's sense of time, or, alternatively, some folks will just say anything to defend what they are doing.


Because human behavior is absolutely fascinating, as is the fact it can be shaped.


Enjoy:

FinanceIQ by AnalyticsIQ - Consumer Finance Data USA - 241M Individuals https://datarade.ai/data-products/financeiq

Individual Consumer Data https://datarade.ai/search?utf8=%E2%9C%93&category=individua...


Seeing this made me feel deeply uneasy….




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: