The delicate dance of
decentralization and aggregation

Ruben Verborgh, Ghent Universityimec

International Conference of the European Library Automation Group (ELAG), 5 June 2018

The delicate dance of
decentralization and aggregation

Ruben Verborgh

Ghent University – imec

©2017 Peter Forret
©2008 Hitchster

Why are we publishing
Linked Data again?

Why are we publishing
Linked Data again?

Why are we publishing
Linked Data again?

Aggregators merge multiple collections
into a single centralized view.

Are we publishing Linked Data
only for the happy few?

Decentralization can be realized
at very different scales.

Every piece of data in decentralized apps
can come from a different place.

Solid is an application platform for
decentralization through Linked Data.

The creator represents each data item
as Linked Data.

PREFIX as: <https://www.w3.org/ns/activitystreams#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<#elag2018> a as:Like;
  as:actor  <https://ruben.verborgh.org/profile/#me>;
  as:object <https://www.elag2018.org/#conference>;
  as:published "2018-06-05T07:00:00Z"^^xsd:dateTime.

Others can learn about my data
through Linked Data Notifications.

PREFIX as: <https://www.w3.org/ns/activitystreams#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

_:announce a as:Announce;
  as:actor  <https://ruben.verborgh.org/profile/#me>;
  as:object <https://drive.verborgh.org/likes/#elag2018>;
  as:target <https://www.elag2018.org/#conference>;
  as:updated "2018-06-05T07:00:00Z"^^xsd:dateTime.

A Linked Data Notification is
posted to your inbox.

POST /inbox/ HTTP/1.1
Host: www.elag2018.org
Content-Type: text/turtle
HTTP/1.1 201 Created
Location: https://www.elag2018.org/inbox/3679efc35

If I place a comment,
you can choose to link back.

Multiple decentralized Web apps
share access to data stores.

When visiting an application,
you log in with an external identity.

Different app and storage providers
compete independently.

Hard-coded client–server contracts are
unsustainable with multiple sources.

We have to manually contact sources
and hope that contracts don’t change.

const me = 'https://ruben.verborgh.org/profile/#me';
const profile = await fetch(me);
const triples = parseTurtle(await profile.text());
const friends = triples.findObjects(me, 'foaf:knows');

friends.forEach(friend => {
  const profile = await fetch(friend);
  const triples = parseTurtle(await profile.text());
  // now find likes
});

Query-based contracts can make
decentralized Web apps more sustainable.

This query captures the same intent
independently of sources and interfaces.

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX as: <http://www.w3.org/ns/activitystreams#>

SELECT ?friend ?like WHERE {
  <https://ruben.verborgh.org/profile/#me> foaf:knows ?friend.
  _:like a as:Like;
    as:actor ?friend;
    as:object ?like.
}

We are building the Comunica engine
to support such query scenarios.

The story of a small metadata publisher:
my scholarly publications.

I have been publishing my own metadata
since before most of these existed.

Why spend time on this while
the aggregators already do?

I want to be the source of truth.
I don’t need to be the only source.

My personal website contains metadata
about my research and publications.

This includes metadata for:

My data is published following
the Linked Data principles.

My data is modeled using several
ontologies and vocabularies.

I publish my own Linked Data because
we need to practice what we preach.

But who am I generating this data for?

The value of my Linked Data
needs to be unlocked.

I want to:

Traversal-based Linked Data querying
cannot answer all questions adequately.

Open questions about
modeling Linked Data:

Be conservative in what you send,
but literal in what you accept.

Solving querying fully at the server side
is too expensive for personal data.

I designed a simple ETL pipeline
to enrich and publish my website’s data.

This process runs every night:

Reasoning on the data and its ontologies
makes hidden semantics explicit.

We use forward-chaining reasoning
in a careful multi-step process.

Reasoning expresses the same data
in different ways for different clients.

 time (s)# triples
extraction 170 17,000
skolemization ontologies 1 44,000
closure ontologies 39145,000
closure ontologies & data 62183,000
subtraction 1 39,000
removal 1 36,000
total 273 36,000

Reasoning fills ontological gaps
before querying happens.

 # pre# post
dc:title 657714
rdfs:label 473714
foaf:name 394714
schema:name 439714
schema:isPartOf 263263
schema:hasPart 0263
cito:cites 0 33
cito:citesAsAuthority 14 14

The resulting data is published
in a Triple Pattern Fragments interface.

This is my personal data
in a Triple Pattern Fragments interface.

What scholarly articles did I write
about Linked Data Fragments?

What scholarly articles did I write
about Linked Data Fragments?

What scholarly articles did I write
about Linked Data Fragments?

What scholarly articles did I write
about Linked Data Fragments?

What scholarly articles do I cite
from within my publications?

What scholarly articles
do I agree with?

TPF query clients find all results
and find them faster.

# resultstime (s)
 LDTPFLDTPF
people I know 0196 5.62.1
publications I wrote 020510.84.0
my publications 13420512.64.1
works I cite 0 33 4.00.5
my interests (federated) 0 4 4.00.4

What topics am I interested in,
and what are their definitions?

What books by authors from Prague
are in Harvard Library?

The Paradox of Freedom:
you can only be free if you follow rules.

We need to identify those rules
we all need to agree on.

Lessons learned from aggregating hundreds of datasets
are highly useful to inform the discussion.

Fortunately, this is where
the semantics of Linked Data shine.

Our metadata starts decentralized.
Why do we centralize via aggregation?

What gets lost in translation
when data is aggregated?

What flows back to data producers
as a return from aggregators?

Imagine all sorts of feedback
we are missing out on.

This knowledge lets (only) you improve your data
and the experience of those who eventually use it.

Decentralization needs replication
for realistic performance.

In addition to technological changes,
we need a shift of mindset.

Current networks are centered
around the aggregator.

We need to create network flows
to and from the aggregator.

The individual network nodes
need to become the source of truth.

Aggregators need to become part
of a larger network.

Aggregators serve as a crucial
but transparent layer in the network.

Aggregators’ main responsibility becomes
fostering a network between nodes.

©2012 Adriano Miranda Vasconcellos de Jesus
©2012 Bernd Thaller
©2012 US Embassy Jerusalem

The delicate dance of
decentralization and aggregation

@RubenVerborgh

https://ruben.verborgh.org/