The before there is data data, otherwise known as protodata

The description of data and information is a perspective based process. Anybody who has read a book on information theory could tell you that you have information that comes from data that is made up of bits. And bits are the smallest functional unit of information. We usually use this as from the perspective of an outsider looking in it works and has been instantiated as a principle for a long time (since 1954?). If we have data that begets information we can have data that describes other data too. As an example 15 gigabytes is a measurement of bits that make bytes and we measure the amount of those. We are not describing the meaning of those bits and bytes just a metric of description. 

In the computer forensics world we deal a lot with metadata. Tools like Splunk and Timestomp either mine metadata or erase it. We have a tendency to talk about metadata as if it isn’t important but it likely is the MOST important part of an investigation. It leaks context and content like crazy. As an example if I know two entities are communicating and one is a known felon I will reasonably assert through association the second entity is also a felon. This of course is fallacious. Even felons have brothers, sisters, mothers and fathers. If the bad guy does something exceptionally heinous like saying nasty things about the FBI director they might swoop down and grab all known associates.

In legal investigations the information of a phone call might be obscured by contextual or technological means that make it meaningless. Pulling context out of a communication is an extremely difficult process. You would think reading what I’m saying is simple and listening to a phone call would be just as simple. That is less than fully true. Context, content, and such are harder to gather when the evidence of malfeasance is ambiguous. What is not hard to do is aggregate metadata into a significant product.

On Twitter I mentioned a term that I ran across awhile back at a conference on big data in the intelligence field. The term was protodata. I’m not talking about the sourceforge project by the same name or the library itself. The concept is fairly simple if a bit circular. I have never seen this concept in scholarly writing but it is really self-evident once you grasp it. We usually think of data from the stand point of information components that we describe and then record. We take interesting chunks and gather that information. Metadata is the data surrounding that gathered information. How do you describe the data you are going to describe before you have described it? How do you talk about things to gather without knowing you are going to gather it? This idea was discussed in the context of, “It would have been great to measure this element but we didn’t think about until after we were done.” This is the same stumbling block among so many that intelligence programs like Total Information Awareness fall down.

Taking the concept a bit further. Protodata is data before there is data. In the conference discussion I alluded to earlier we talked about how commerce is interested in that idea quite a bit. How do you predict what you will need to gather when you don’t know if a product will be successful? When the product is beans and potatoes you have a great concept up front of the logistics and supply chain data. When the product is digital media and content services the data and supply chain information is significantly more dynamic.

Inherently we deal with protodata by prediction and the closer to execution the better the prediction will become. Some will say this is a silly concept and laugh it off, or worse mock the entire idea. They did much the same when we talked about data warehousing in the mid 1990s and later data mining in the late 1990s early 2000s. Now we have big data and a growing suite of tools. This isn’t a concept we’ll see have any kind of traction now but if you’re leaning forward a bit you can start to get a glimmer of what it means. Jokes about minority report and predictive analytics aside you will see that intelligence and corporate communities would be very interested in chasing this idea. Fundamentally we’re talking about breaking the barrier of past performance does not predict future results.

I don’t want people to think I came up with this as some kind of fog from sea concept. Way back in the day long before things like TimeStomp existed my wife wrote a metadata editor. That is her story to tell, but when she asked her boss what kind of metadata editor she should write he said, “a good one.” The idea of editing metadata is an interesting one. If you are going to edit metadata you need to know what metadata you are going to edit. That is a discovery process. What if you don’t have the data currently in your possession but know you will want to edit it in the future? Now we’re talking about preparing a tool for protodata.

I admit to being fuzzy on the concept but after discussing it on Twitter with a few people this little trip down geek trivia needed a response that is longer than 140 character chunks. With the recent focus of the news on metadata that means even more discussion about metadata might be important.

My example I like to use for metadata is the content of a tweet. At a 140 characters there is not a lot of content or context in a tweet that most people see. Lots of people get really upset over the idea of government reading their tweet, but the real data is behind a tweet. The date the tweet was created, the screen names, the authors user name, the biography of the author at the time of the tweet, the creation date of the account, the number of favorites the user has, the following, and followed numbers, timezone set, selected language and much more. All of this is context rich for analysis of a single tweet. All of that is metadata for a tweet. We have to realize each element on it’s own is actually data and would be information found in the field of a database.

We find that in discussing metadata there are a variety of types (structural, context, typing, etc. ) and that throwing the word around is near meaningless without providing data about how we’re going to use meatadata. Yes that would be metametadata. I just had to finish with that little bit of silliness. If you’re interested in more look me up on twitter and let me know.

Leave a Reply