In fact, we’ve stored about 1.4 million metadata facts across 890 datasets.
Why so much? There couldn’t possibly be that much metadata in the NYC Open Data catalog.
And you’re right, less than a third of it is conventional metadata. We also sampled each dataset. And against these, we’ve calculated lots of “extrametadata” – stats, scores, machine inferences, etc. etc.
Because data is the lifeblood of the Information Economy. And to effectively use it, one needs to understand the data – both machines and humans.
Yep, machines AND humans.
Because there is just too much data for the user to go through. Even with all the search mechanisms Socrata has exposed. Even though Socrata leap-frogs what NYCBigApps had from years past – basically, a web server loaded with various data files in different formats – its an embarassment of riches too much even for NYCBigApps innovators, much less the general public at large, to go through.
And even our PCrank calculations confirm this – just look at the number of downloads and views for NYC Open Data. Top on the list is Wifi hotspot locations. Why? Is there such a pressing need to know where those hotspots are? If you look at the dataset, can you quickly find out where a nearby access point is? Is it information previously unavailable before that is so compelling that it is far and away the most viewed and downloaded dataset in NYC Open Data?
Nope. Its because it was the first item on the list. If you go through the View distribution chart, you can see that views quickly trail off as you go beyond the first page.
So we need to compile the metadata and the extrametadata so that Machines can preprocess the information and present it in a way that facilitates Human understanding.
And add to that feedback mechanisms so that we can create a virtuous cycle that taps the Wisdom of the Community (machines included) to continuously refine the data.
SO NYCFacets is much, much, more than just an online data dictionary. (Though we’d like to think, its an awesome one – with all the techniques we’ve leveraged to navigate the data – Faceted Search, Google Instant-like searching, Search autocompletes, Semantic browsing, Visualizations, Inline queries, Drilldowns, Multi-way Data Explorers, etc., etc., etc.)
Its just the beginning of our work at Pediacities to help lay down the foundation to accelerate Smart City data innovation as we pursue NYC’s Digital Roadmap.
Over the next few days, Sami and I will go through NYCFacets and demonstrate how it can help NYC Open Data innovators…