A Freebase data distribution that's easy to use.
Freebase is an amazing data resource at the core of Google's "Knowledge Graph". Freebase data is available for full download but as today, using it "as a whole" is all but simple.
The SindiceTech Freebase distribution solves that by providing all the Freebase knowledge preloaded in an RDF specific database (also called triplestore) and equipped with a set of tools that make it much easier to compose queries and understand the data as a whole.
Why having all the data locally?
You basically get your own private freebase. This means you can query as much and as complex as you want, you wont be "revealing out" what you're looking for, you wont be having external dependencies. More than that you can easily combine Freebase data with your own datasets and query the same in a unified manner. Its a great start to your own "Knowledge Graph".
On Google Cloud
The distribution is packaged as a virtual machine snapshot that you can easily spin up in Google Cloud. Join our Google group and follow the instructions to get started and have it running in minutes.
The distribution comes in form of VM on Google Cloud. Once unpacked and started it contains:
Before you can start consuming data from Freebase, you need to understand what is there in the dataset. What are the types of entities that are present? What are the instance counts? How are the instances connected each other? And so on.
The Data Types Explorer
Being able to see how the data fits into categories and what are the most specific types to refer to it is key to writing queries that return the right results, with no more noise (or less results) than needed.
Freebase has thousands of types as part of its entire dataset. What’s more, an item in freebase can be associated with multiple types resulting in a complex type hierarchy.
Furthermore freebase does not have a strictly enforced type hierarchy itself which means that:
The following figure illustrates the lattice ordered by number of types per entity:
In the first column on the left are the groups of entities which are marked with a single data type. For example if freebase contains an entity named "Acme the Picture" simply typed as film.film then this entity will be counted only in the first box on the bottom left of the picture.
According to the data in this release, there are 91,555 such entities (the size count in the diagram) marked as film.film but not as anything else.
On the other hand, the cumulative size label indicates how many entities are there that also have film.film as a type (in this case 159,769).
In the case of award.award_nominated_work we see that there exist only 2890 entities over 56,029 which do not have any other type. This is expected as one would expect these award winning works to be also properly typed.
Moving to the second column then we can see cluster of entities which have 2 classes, for example the cluster having entities which are award.award_nominated_work but also film.film. This is a set of of 4313 entities (with just these 2 types) while the total number of entities that have at least these two types - again the cumulative size is 11,486).
The Data types Explorer which we include in this distribution is allows you to explore all this. In its tabular form it looks like this:
The tree starts from the most popular types (by cumulative type, which is the most indicative number) and allows drilling down to cluster of entities which have more and more types at the same time.
As the order of the types do not matter, the same numbers can be seen going from different "roots", e.g., one would encounter the same numbers starting from either tv.tv_series_episode and then film.film then following the opposite route.
Also note that the Explorer also has a search feature which allows you to find cluster which contain the specified string in the name of any of the contained types.
The Data Type Explorer - Graph Version
The same data type hierarchy data can be accessed in graphical form, which gives further hints on how freebase data is organized.
Here the hierarchy is show with bigger dots being the single class clusters and weights being based on the cumulative counts.
Note that in order to constrain the complexity of the visualization, this view cuts off the smaller clusters so part of the data in smaller clusters, or clusters that are not connected will be unaccounted for.
The contribution of this visualization is that it visually cluster together types that have "something to do" with each other creating. To see this, try "film" or "track" or "politician" or "city" as an input in the search box to see the relative part of the class "hierarchy" highlighted.
Querying Freebase with the assisted SPARQL Query editor
The distribution comes with a preloaded Virtuoso triple store which you can begin querying straight away using the SPARQL query language. For those not familiar with it, SPARQL is a mature query language which is specialized in querying Knowledge Graphs expressed in RDF.
Writing SPARQL queries however is often difficult, e.g., classes need to be specified with their full URIs and it is not obvious what properties apply to which class.
For this reason, this distribution includes an assisted SPARQL query editor (SparqlEd) which uses the summary graph to provide auto-completion and suggestions.
The suggestions are activated by pressing CTRL Space (Windows/Linux) or CTRL + tilde (Mac).
The suggestions use the context of the query to recommend only properties which are related to the types in the current scope. The editor does not mandate these types so the user is anyway free to edit the query at will.
As an example and exercise, try composing the query below:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
Note: direct access to the Virtuoso SPARQL endpoint (no auto-suggestions) can still be performed by accessing http://your.google.cloud.ip/sparql/.
In our video interview a demo of SPARQLed in action can be seen here.
The Data Graph Summary
If you have a SPARQL endpoint, you can in theory get these answers simply running analytic queries. When graphs are very large, however its very handy to have all this precomputed.
In this distribution we include not only Freebase data, but also an additional data graph called "Data Graph Summary" which we compute offline using Hadoop. For more details on this see later section on "Querying the Data Graph Summary".
The summary graph is the key to the following exploration tools that are in the distributions.
At the heart of the capabilities of the tools in this distribution lies the Data Graph Summary, an extra RDF graph which is computed from the Freebase data and then loaded "next to it".
Querying the Data Graph Summary
Triplestores support multiple "named graphs", that is individual graphs made of triples whose name is identified by URIs. In this distribution the data itself is stored in the graph with URI:
On the other hand summary graph is stored in the Virtuoso store as a named graph with URI:
The summary graph is based on the idea of clusters of instances (called ?node in the query below). For the simplest type of summary that our Knowledge Graph Analytics produces, these clusters are given the label of the type.
We can query the summary graph for getting the counts of a cluster of a particular type (people.person) and sort the results by the their counts.
prefix a: <http://vocab.sindice.net/analytics#>
Built with the support of the Google Developer relations team. Specifically thanks to to Jarek Wilkiewicz, Shawn Simister, Dan Brickley. Thanks also go to the Openlink Software Virtuoso team for the VM parameter optimizations.
Q) Can i have the distribution in another platform other than Google Cloud?
A) We're considering offering the same distribution on Amazon or other platforms. Please contact us if this is of high importance to you. If you want to move the data manually to any other computer you can of course do so, these notes to support the task.
More questions and Support
Having trouble with the demos? For public questions please post your question in the group.
For specific focused help, please contact us and we'll help you sort it out.
Release Date: 6-2-2014
Tools included in this version:
On the Google Developer Channel, a 20m conversation covering: