How wide was the Equifax data breach?

143 million US consumers were caught up in the data breach. I keep seeing it portrayed as 44% of the US population. But, the US population includes children.

Initially, it seemed to me the better metric was 11 million more than all of 2016 IRS tax filers. The problems with this latter comparison? Lots of people who file taxes might not have a credit history and some with credit histories might not file taxes in a specific year. Which brings up taxes for a specific year comparing against people who had a credit history across many years is sketchy.

Other statistics give me headaches too.

  • The US Census’ latest 2016 estimate is that there were 325M (million) people in the country. The original 44% statistic is based on that.
  • The US Census’ latest 2016 estimate is that there were 249M adults in the country. That brings the percentage up to 57%.
  • The Bureau of Labor Statistics says in July 2017 when the hack occurred, there were 160M members of the civilian noninstitutionalized population. That excludes inmates and members of the armed forces most of whom probably have credit histories.

So, I took the BLS 160M and looked up the excluded populations.

  • It looks like there were about 1.5M in the prisons.
  • And there is about 1.4M active military.

Combining these, it looks like about 88% of people in the “potentially have worked population” were affected.

I feel good with the 88% number.

Really, though, everyone probably has had their SSN and birthday exposed.  If you have ever attended a K-12 school, post-secondary education, gotten insurance, gone to a doctor, engaged in any way with a financial institution, or given your SSN to a government entity, then you should assume that your personal information is ready to be exposed at any time. Nor should you rely on being told. The state of Georgia exposed every voter’s SSN to subscribers of the voting list by accident and notified no one because they felt the CDs being returned meant no one could have the info. (Because the subscribers could not have copied the files off the discs.)

Death of Childhood Icons in 2016

This post will date me, but people about my age have whined a lot this year about iconic people to our lives dying.

A year is an arbitrary range of time for a solar revolution. It could could run from March 21st (the spring equinox) to March 20th just as easily as it currently runs January 1st to December 31st. Or another range of dates.

Some people have blamed the year. (Hopefully not seriously.) I think it is because we are essentially older. Our parents are of an age where they are more likely to have health problems and possibly die. The same as their parents did for them about 20-30 years ago.

The people who were famous for things they did in the 70 and 80s are of an age where they are more likely to die from complications due to being old.

Others were younger, but still getting close to old. Some of these are also known for their drug use.

The BBC made an interesting point…

There are also more famous people than there used to be. In my father or grandfather’s generation, the only famous people really were from cinema — there was no television. Then, if anybody wasn’t on TV, they weren’t famous.

Well, there were radio stars prior to TV. And these days TV and movie and theatre actors / actresses significantly participate in multiple mediums.

There is also the 27 Club, who are celebrities who died at age 27. Think Jimi Hendrix or Janis Joplin or Kurt Cobain. There was one this year: Anton Yelchin who played Chekov in the Star Trek reboot.

Finally, the availability heuristic is also at play. News organizations talking so much about these deaths can make it FEEL like more died this year than last. Just because the number of stories is up does not mean more died.

Better Predictions

13.7: Cosmos and Culture has a good article on predictions.

Making good predictions isn’t just about your accuracy; it’s also about your calibration.

Accuracy = how often correct
Calibration = confidence level in the prediction

All too often when we see predictions no one asks about the calibration. Nor do we go back and check the accuracy. I know I am weird in the sense that when I make predictions for work such as a file system is going to run out of space in x days, I often place my level of confidence in that prediction. (It might happen sooner, but it could also happen much later.)

The greater our expertise, the worse our overconfidence level tends to bite us. Knowing a lot hurts us. We tend to minimize the space of allowable error in our thinking when our confidence is too high. Even a worrier like myself falls victim often enough. Here is an example overconfidence test.

The paper, by psychologists Joyce Ehrlinger, Ainsley Mitchum and Carol Dweck, reports three studies in which participants were asked to estimate their own performance on a task, either a multiple choice test with antonym problems or a multiple choice general knowledge quiz. The participants were asked to estimate their percentile relative to other students completing the task, from zero percent (worse than all other students) to 100 percent (better than all other students). If participants were perfectly calibrated, then the average percentile estimate should have been 50 percent. But that’s not what they found. In Study 1, for instance, the average was 66 percent. Like the children of Lake Wobegon, participants (on average) believed themselves to be better than average.

Thinking back about many of my work predictions? They were probably way to high. Something that was essentially a super wild ass guess (technical term is SWAG) may have been reported as 70-80% confident. It was informed by metrics and a trend, but there was no reason to think that trend would continue.

Solution to Black-On-Black Crime

(This post on black-on-black crime is satire. Thought I would point that out before someone gets too upset over it.)

A Georgia state lawmakers showed concern about the amount of black-on-black crime. He is obviously referencing the FBI homicide statistics for perpetrator race. Fortunately the actual number is lower than the 98% he claimed. But he’s not wrong that a large majority of homicides of black people are by another black person at 90% in 2013. Here is the thing. The same table shows 83% of white homicides were by another white person. It seems likely both communities have a common problem. They both spend too much time around people of their own race.

If your true goal is to lower black-on-black crime rates or white-on-white crime rates, then the best solution is to effectively mix the races so much they are willing to murder across racial boundaries. People tend to kill the kinds of people with whom they spend their time. If you rarely are around people of another race, then your opportunities are fewer. Mixing the races such that people are just people will do wonders to ending the black-on-black crime problem.

His remarks though were that all this concern about eviscerating the Confederate flag or memorials were to hide this black-on-black crime problem and other issues with the black community. He also describes the KKK as vigilantes necessary for law and order. So I in no way expect him to agree better racial integration is an appropriate solution.

Review: Dataclysm: Who We Are

Dataclysm: Who We Are
Dataclysm: Who We Are by Christian Rudder
My rating: 3 of 5 stars

Maybe really 2.5 stars, but I rounded up.

I have read the OkTrends blog since its inception. Human behavior fascinates me, so I take any opportunity to read on it. The We Experiment On Human Beings post ensnared my attention since it flubs its nose at academic sensibilities at what is ethical experimentation. But, this review is not about Rudder’s ethics, so I will move on to the book.

The writing engaged a technologist interested in Big Data, interesting links, and how data can be used in interesting ways. (Hardly surprising.) Many references made me laugh out loud. I highlighted 32 places according to my Kindle stats. Much more were worthy. The writing alone would make me give it 5 stars.

My first problem manifested in the lack of details in the main text. Where I expected to read about how conclusions were reached, the details were light. Where it all fell apart for me fell in the Coda section where he delved further into the methods used. Suddenly the assumptions, based on nothing but super wild ass guesses (SWAGs) came into complete view. For example, his conservative estimate is that active OkCupid users go on at least one date every two months and uses this with active users/month to arrive at 30,000 dates will happen tonight because of OkCupid. This number is used for other calculations. I would give this aspect no stars.

So an average of 2.5 stars rounded up is the reviewed 3.

View all my reviews

Shortcuts: Math

(This post is part of a series. Intro > 1. Illusions > 2. Labeling > 3. Math > 4. Multitasking)

Behavioral economics fascinates me. Humans have amazing abilities to miscalculate risk with extreme confidence they accurately assessed it. These appear to be rules of thumb which work in certain situations, but really are not applicable to others yet most people do.

Part of the problem gauging risk, I think, comes from a lack of consequences in low risk situations. Switching from writing a script to answering an email and back while sitting at my desk is extremely low physical risk. Switching back-and-forth between driving and answering a text message can seem like no big deal when even 23x more likely to have an accident is still one in thousands. A lack of having an accident or close call while driving is seen as evidence of the ability to text and drive without a problem. (After all how risky is it operating a car of several hundred pounds?)

Following the causal chain of events presents us with problems. We sometimes pick the wrong causes. We then are more likely to pick that wrong cause over and over. Logic and science are tools invented to combat these problems. Testing the idea with large samples eliminate variation as a confound. Others testing with the same or slightly different experimental designs point out the relevant scope.

“Garbage in; garbage out” can also trip us. We poorly assess the reliability of inputs from illusions I discussed earlier, so the calculations based on garbage were never going to be good anyway.

Strangely enough slowing the process down and thinking about it from many different angles can even exacerbate the problem as we get mired in so much data or processes we cannot make a decision.

Technology helps us do the same calculating just faster. Some helps us validate the outputs. I look forward to technologies that help us identify the correct inputs. My big beef with predictive analytics is doubt the correct inputs are being identified, so the outputs might have lots of garbage. 

(This post is part of a series. Intro > 1. Illusions > 2. Labeling > 3. Math > 4. Multitasking)

Accounting Predictions

In my Prediction Accountability, I ranted on how no one really knows whether predictions are accurate and ended with it really does not matter because no one is going to really stop using these services because they are usually wrong. Basically, I thought it futile to even try. In retrospect that is probably the perfect reason to do it.

So I came up with a scoring system:

    • Good Recommendation= 3 points
    • Not interested= -1 points
    • Wishlist/Queue= -2 points
    • Dislike= -3 points

Would you score these differently? Why?

My reasoning goes something like this. Something I agree I should watch should equal the inverse number of points of something I know I will dislike from previous experience. Anything I am not really interested in definitely is not a win, so it should be a negative, but not too close to a dislike. Suggesting something already on that company’s records that I am interested in wastes my time because they already know I am interested in it, so lose two points.

First pass, Amazon sent me an email today saying,

Are you looking for something in our <x> department? If so, you might be interested in these items.

One item I have thought I should watch based on TV ads but not put on my wishlist yet, so I agree with Amazon, I might be interested in it. It gets three points. (3) Five items already were in my wishlist so that is negative two points each. (3 -10= -7) One item is the 6th season of a television series I have only seen part of the first season and not gotten around to completing even that so not interested and negative one point. (-7 -1= -8) Another item is the 3rd season of a TV series I where I have not watched even the first yet. If the recommendation had been the first, then I would count it as a good one so instead I’ll award halfway between good and not interested (-8 + 1 = -7) Out of eight items in the email, the score is a -7. That is just one email. I track this for a couple months and see where it goes. And do the same for Netflix.

I think this exercise points out the possibility that these “predictions” are basically nudges more to buy something.

If your Learning Management System vendor claimed they have a 90% plus correct prediction rate for whether students will fail a class, then how would you assess it? The obvious start would be track the predictions for classes but do not provide the predictions to instructors. Compare the predictions to actual results. Of course, these things are designed around looking at past results. What is the investment company statement they have to put in so they do not get sued for fraud? Oh, right, “Past success does not guarantee future performance.” So I would not rely too much on just historical data. I would want a real world test the system is accurately working.

Prediction Accountability

The technology buzzword standard for prediction appears to be Netflix and Amazon. Everyone wants to get to where they make recommendations customers will buy. But are these predictions any good?

Out of the slew of emails you get from Amazon, how percentage do you actually buy? How many do you sneer at it and hit delete in disgust that they could get it that wrong? For me, the latter is more common than the former. Certainly it is not from a lack of data, I buy more off that site than I do all bricks and mortar stores excepting groceries combined. (And that makes me re-think how I buy groceries.) Maybe Amazon has too much data that confuses it mixed with correct data. I look things I have no interest in buying such as someone mentioned having problems with a product. Though I have to question Amazon recommending I buy the camera I bought from them a couple months prior.

Netflix really is not any better. Their top 10 recommendations change weekly for me. In my current top 10, one was already rated 5 stars. Another four were already in my queue. The remaining five predicted I would like them between about 3.0 and 3.3 stars. That is out of five. There are 27 items in my queue with higher predictions than these.

Before I start tracking these predictions to gauge how effectiveness, do I even really care? Am I going to stop consuming from companies that overstate their claims? Or should I close my ears when clueless people spout the prediction buzzword? Not really. No. Guess that is what I am left doing.

I think the standard comes not from them being any good. Instead decision makers are aware of them, so they understand wanting to emulate them.

Dashboard vs Feed

John Pavlus in Ghost’s Blogging Dashboard Doesn’t Need to Exist fell hook line and sinker for Anil Dash’s All Dashboards Should Be Feeds false dichotomy. The better argument is dashboards only tell the past with all the noise where the more useful information is an accurate future. People ultimately want to know what is going to happen. The feeds would do that.

However, to accomplish that feeds take the same data, apply criteria, and report a prediction of value to the user. That’s fantastic stuff. You know… Fantasy.

Someone has to decide how to produce the signal out of all the noise. Probably that is a quant or a wannabe who teases out of the data the important predictions. So unless you are beholden to someone like Anil, you want to be able to manipulate the data by looking at something like a dashboard to build feeds.

Not everyone is like me, I get that. Simple users want a magic number or an easy indicator of what is going on. Think of an alert that a site is going to break in 15 minutes. Power users like me want to know if components of those web sites are going to break 15 minutes from now. You know, so I can go fix it. But I would not mind being able to allow others to subscribe to my feeds where appropriate.

I’ve never had a problem taking dashboard data and projecting from them trends. A good one, like Yaketystats will even graph the prediction lines for me. I often work with the data to see how this line changes in order to get a sense if the prediction has biases built into it. But then, I enjoy being hands on and manipulate the graphs to see what I want to know. Predictions are only as good as the algorithm. Any why should we trust other’s when we can build our own? I could see YS with alert feeds for directors and above letting them know about upcoming milestones. It would be great for them, but that high level view is not so interesting to me. I want the details and build the things that produce the signal from the noise.

Undercounting Stats

Michael Feldstein posted on Twitter:

Seeing signs that Google Analytics significantly undercounts. Any recommendation for easy, reliable db-based WordPress analytics?

I knew Google Analytics relies on JavaScript to measure what users are doing. Bots typically do not execute JS, so go undercounted. That is OK, probably even great depending on how much they annoy me. It occurred to me browsers now incognito modes, which a desirable feature while in that mode would be to not execute known JS stats.

A response to Michael was:

Maybe try Jetpack? Has analytics built in.

I looked at the HTML for my own site. Jetpack appears to be JavaScript based as well.

Looking at Jetpack’s stats, though, I noticed a significant spike in traffic on September 27th. It got 487 hits compared to around 200 each day two weeks prior and since. Details for that day said my Nationalism post had 267 hits compared to my normal leader the Quotes to Make You Think. This made me curious. So I looked up the same day in Google Analytics. No spike in GA. So I pulled the raw access logs. The hits exist, but almost all were from a single IP. No visits to this page according to GA. Impressively disconcerting. I expected from Google Analytics 1 hit for the DSL user with 200+ hits, maybe 1 for the IP with no reverse DNS, and 0 for the Facebook bot.

Anyway, I looked at various WordPress plugins. I think WP Slimstat is the db-based WP analytics I will check out. It looks mature and seems pretty consistent with what I see in the hits. Too bad I did not add this a long time ago so I can compare Slimstat to GA and Jetpack. Will have to let it collect data and do this again.

Good thing I enjoy this stuff.