Data intelligence and Metadata Inference
I recently spent some time exploring one of my favorite data intelligence related topics: metadata inference. How do we take an undocumented dataset, like a CSV file, and automatically generate documentation and knowledge around it with no or minimal human intervention. The end goal being to uplift data and complement with machine actionable metadata and transform it into digital knowledge.
This is something I have been researching for years and have many scribbled notes and ideas about. But it is time to take my understanding and passion about this to the next level by turning this research into concrete tools and APIs.
Synthetic Data for R&D
I expect that this work will lead me to delve into advanced data processing, knowledge management, and machine learning. However, for now, my initial focus is on fundamental data profiling and building an information model. I have been diving into RDF and related standards like SKOS and OWL, as well as identifying various Python packages to support this.
I’m still working on the foundation, but, being a hands on person, I know that I will quickly want actual data to test all this against. While I could and will use existing datasets, or write my own synthetic generators, I remembered in the past using Mockaroo for similar purposes, and thought it would be a good time to revisit it.
Mockaroo is a web based tool that allows users to quickly and easily produce fake datasets for a variety of purposes, such as testing software, populating databases, or creating sample documents. It offers a wide range of options for customizing the generated data, including the ability to specify data types, patterns, and constraints.
Mockaroo can be used to rapidly create and manipulate large volumes of synthetic data, and also comes with an API. The free version will many user’s requirements, and paid options are available if necessary. It goes a long way, and will save you countless hours of developing your own generators.
CSV files for profiling
For my research, I need CSV files that I can parse for basic profiling such as detecting variable data types, generating descriptive statistics, analyzing value domains, identifying string patterns, and detecting missing values. Mockaroo is well-suited for such purpose as it can, as expected, generate data records with various numeric, string, and date/time variables.
My next level of analysis though will go beyond traditional profiling, and aim at inferring the role and conceptual nature of the data, such as whether a variable is an identifier, a phone number, a geographical location, a point in time, or a person’s characteristics. This is where Mockaroo becomes really handy as it includes a preset of extended data types (169 at this time), including city names, country codes, individual or company names, health diagnostics codes, and many more.
Expert knowledge inferring agents should then be able to recognize these and associate the variables with the appropriate concepts, domains, classifications, or deduce the nature of the records.
I didn’t want all of possible data types in a single test file so started to create a collection of Mockaroo schemas, with each variable representing one of the available type or its variations. Here are the few I came up with: Numbers, DateTime, Person, Health, Information Technology, Commerce, Locations, US Locations, US Address.
Note how easy these are to share as Mockaroo generates a public URL for each created schema, as well as the option to make it available as an API end point. Feel free to browse and use for your own needs. You can register and get a free key if you want to use the API instead of the web UI.
I naturally have quite a bit of work in front of me before my knowledge inferring vision becomes reality. I’ll share more about this in future articles and through GitHub and APIs when in decent shape. But Mockaroo is definitely a component of my toolkit.