IT Data Profiling That is Not Happening Before the Invasion of the “Data Snatchers”That Sell our Profiles & Informa
Posted Aug 24 2011 1:12am
When you hear the word privacy today you also hear the word profile but the purpose of this article is not to discuss that profile, but rather what he companies do or do not do before selling and aggregating information from one area to another. We get this a lot in medical records when one let’s say review of system is in the chart, and then every other doctor seeing the patient uses the same document or file, but what if it’s wrong? Well you may have a dozen or so people with wrong information and then how do you get to the bottom of it? This is a big issue today.
Unfortunately when data is not profiled and checked for errors, we have the folks out there that think because it’s in the data base it has to be correct…wrong. I have proven that on this blog with the web based MD referral sites and that’s huge as the sites are not updated as they should be and why I found my former dead doctor still listed in the internet after life and still available to see patients. There was no profiling done to check or accuracy there.
Nobody at Health Grades or any of the other companies that do the same were checking and thus we had no company “data profiling”. Most just want to use the data, sell it or what ever they do with it and run to first base. In some cases information is just scraped from the web and it’s fat chance there’s any “business data profiling” to clean stuff up going on at all. Social networks do not always get it right and yet people believe that what’s on there is 100% accurate and that comes back to digital literacy.
The example in this story is great and drives the point home and will give you something to think about when it comes to data entry. There was a required field and when the data entry clerks were paid by how many forms they could fill in and the information was not there, they made it up…hmmmm…is this what is called “pay for performance”….sounds like it to me and shows how the carrot does not always lead to accuracy. Now in the example data profiling engines could be run to check and see why so many had the same birthday…in other words it looks for certain patterns, just like the software does that checks medical claims. We have tons and tons of software there alright but when it comes to doing some data profiling before shipping our data out for sale, nope this does not happen as they can make a buck without it and later when things don’t jive, the consumers get fleeced and have to prove their innocence as again everyone thinks that what is on that screen is 100% accurate when it comes to business.
This brings me back around to the FICO fleece to where they are selling information based on your credit scores with a lot of other information they obtain freely from the web stating it can predict whether as a patient you will take your medications. This is flat out marketing of data and you can bet it shoved together to make that sale to a pharmaceutical or insurer as that is their market with little or no profiling of the data, just queries only for “desired” results. They don’t care if the data is accurate or not, they want a sale and we as consumers are fleeced.
I made a post a while back which is still a good idea to require licenses and tax this money since we can’t really stop it and you can see the damage that the over use is creating with our economy based on algorithms only without enough companies actually producing a tangible product in the US.
Now with data systems being combined and re-analyzed, we are getting to the point with machine learning to where the computers don’t know the difference and we end up with no tracking back to see where the original information came from, again another good reason to license and tax so we have a point of origin. We should learn this from Wall Street so as not to repeat what they did, same thing with algorithms that react with each other with no human intervention. In short there’s nobody minding the data store and it’s like a used car being sold from one party to another in very short period of time with nobody even bumping a tire because to analyze and do the company data profiling, it’s costs money to invest and that eats up the profits they make selling the data.
We are now writing the unreadable and this will eventually be a huge mess with all types of corrupt and tainted data if there’s not a registry started soon so there’s an audit trail and again the corrupted and flawed data that will be held against you at some point in time that nobody will ever be able to trace. Watch this video below and see the algorithms of how 2 different computer run vacuum cleaners see your carpet and you’ll get the picture as all algorithms are not created equally. We have a run-away train here with nobody minding the store and data will grow to be more unreliable and polluted just like the water, but also like the water we can take steps to avoid this and clean it up. BD
.. we were doing data profiling on some fields like date of birth. We were noticing some weird trends – people tended to be born on Jan. 1, Feb. 2, March 3, April 4, May 5 – and we were like, “Okay, what’s going on? Why are those particular dates coming up?” We found out that the date of birth field was a required field on an insurance application that most people – like if they were applying for automobile insurance – didn’t feel the need to provide.
But the data entry clerks are being paid based on how many applications they can get in per hour, so when they came across a field that was required that didn’t have a value for, they just made a date up and they just basically picked their birth year, but used Jan. 1 or Feb. 2. So you had a bunch of bogus dates entered into the system. The data value was accurate. Jan. 1, 1970, is legitimately somebody’s birthday, but it’s not the birthday of the customer that was associated with the insurance transaction.
When developers write a new application for the input of some new data, it's normal for input fields to be 'validated' - a simple 'hard coded' form of profiling. ... Yet people have far fewer reservations about integrating data from here, there and everywhere - often not checking for even the most egregious data errors, and thereby polluting the organizational drinking water. Data profiling engines are a great technology for quickly improving the quality of data as it is integrated from one system into another.
In the long run, data profiling can be used both tactically and strategically. Tactically, it can serve as an integral part of data improvement programs. Strategically, it can help managers determine the appropriateness of different data source systems under consideration for deployment in a particular project.