This document identifies problems with SPARQL EXISTS and describes two proposals.
Draft.
The evaluation process in the specificiation is defined for graph patterns but there are situations where the evaluation is of an alegbra form not listed.
For example:
FILTER EXISTS { SELECT ?y { ?y :q :c . } }and
FILTER EXISTS { VALUES ?y { 123 } }
The argument to
exists
is not explicitly listed as a "Graph Pattern" in the table of SPARQL algebra symbols in
section 18.2
when the argument to EXISTS
is a
GroupGraphPattern
containing just a
subquery
or just
InlineData.
There are places in the SPARQL syntax and algebra where variables are allowed but not RDF terms (constant values).
Example:
FILTER EXISTS { BIND ( :e AS ?z ) { SELECT ?x { :b :p :c } } } }
Both positions "AS ?z" and "SELECT ?x" must be variables.
In the algebra, this affects
extend
(related to the use of AS
in SPARQL syntax)VALUES
)BOUND
In the evaluation of basic graph patterns (BGPs) blank nodes are replaced by RDF terms from the graph being matched and variables are replaced by a solution mapping from query variables to RDF terms so that the basic graph pattern is now a subgraph of the graph being matched.
Simply substituting a variable with a blank node in the EXISTS
evaluation process does not cause the basic graph pattern to be
to be restricted to subgraphs containing that blank node as an RDF term
because it is mapped by an
RDF instance
mapping before checking that the BGP after mapping is a subgraph of the
graph being queried.
Note that elsewhere in the evaluation of the SPARQL algebra, a solution mapping with a binding from variable to blank node, does treat blank nodes as RDF terms. They are not mapped by an RDF instance mapping.
Example:
SELECT ?x WHERE { ?x :p :d . FILTER EXISTS { ?x :q :b . } }
against the graph { _:c :p :d , :e :q :b }
the substitution for EXISTS
produces
BGP(_:c :q :b) which then
matches against :e :q :b because the _:c can be mapped to :e by
the RDF instance mapping that is part of pattern instance mappings in
18.3.1.
In SELECT ?x WHERE { ?x :p :c . FILTER EXISTS { ?x :p :c . MINUS { ?x :p :c . } } } on the graph { :d :p :c } the substitution from 18.6 ends up producing Minus( BGP( :d :p :c ), BGP( :d :p :c ) ) which produces a non-empty result because the two solution mappings for the Minus have disjoint domains and 18.5 dictates that then the result is not empty.
In SELECT ?x WHERE { BIND ( :d AS ?x ) FILTER EXISTS { BIND ( :e AS ?z ) SELECT ?y WHERE { ?x :p :c } } } the substitution from 18.6 ends up producing Join ( Extend( BGP(), ?z :e ), ToMultiSet( Project( ToList( BGP( :d :p :c ) ), { ?y } ) ) ) Some, but not all, implementations diverge from the spec here.
All filtering in SPARQL is determining whether a solution mapping passes some condition. We call this solution mapping the current row in this description. We call the translation to the SPARQL algebra of "pattern", the EXISTS pattern.
Evaluation of the EXISTS
function is defined by the process of
substitution
appiled to the EXISTS pattern, which is then evaluated. The EXISTS filter
expression is true if the evaluation results in one or more solution mappings.
This is a section proposes an alternative mechanism. Rather than replace each
variable by the value it is bound to in the current row, this alternative
mechanism makes the whole of the current row available at any point in the
evaluation of an EXISTS
expression. It uses the current row to
restrict the binding of variables at the points where variable bindings are
created during evaluation of EXISTS
to be those from the current row.
It makes illegal syntactic constructs that could lead to an attempt to rebind a
variable from the current row through using the AS
syntax.
Section 2.2 describes how this alternative definition of
EXISTS
addresses each of the issues identified in
Identifed Issues section.
There are 3 parts to the proposal:
EXISTS
. This is the
replacement for substitution; evaluation that binds a variable proceeds through
graph pattern matching but the values it can take are immediately restricted to
that in the current row.
Within sub-queries, variables with the same name can be used but do not appear in the overall results of the query if they do not occur in the projection in the sub-query. Such inner variables are not in-scope when they are not mentioned in the projection part of the inner SELECT expression.
SELECT * { ?s :value ?v . FILTER EXISTS { {SELECT (count(*) AS ?C) { ?s :property ?w . }} FILTER ( ?C < ?v ) } }
Here, the ?s is not mentioned in the projection in SELECT (count(*) AS ?C). Replacing ?s by, for example, ?V1234 in the sub-query does not change the overall results.
SELECT * { ?s :value ?v . FILTER EXISTS { {SELECT (count(*) AS ?C) { ?V1234 :property ?w . }} FILTER ( ?C < ?v ) } }
Such variable usages can be replaced with a variable of a different name, if that name is not used anywhere else in the query, and the same results are obtained in the sub-query. A sub-query always has a projection as its top-most algebra operator.
To preserve this, any such variables are renamed so they do not coincide with
variables from the current row being filter by EXISTS
.
The SPARQL algebra "project" operator has two components, an algebra expression and a set of variables for the projection.
For a projection algebra operation P with set of variables PV, define a partial mapping F from V, the set of all variables, to V where:
F(v) = v if v in PV
F(v) = v1 where v is a variable mentioned in the project expression, but not in PV, and v1 is a fresh variable
F(v) = v otherwise.
Define the Projection Expression Variable Remapping PrjMap(P,PV) to be the algebra expression P (and the subtree over which the projection is defined) with F applied to every variable of the algebra expression P over which P is evaluated.
This process is applied throughout the graph pattern of EXISTS
:
For any algebra expression X define the Variable Remapping PrjMap(X):
PrjMap(X) = replace all project operations project(P PV) with project(PrjMap(P,PV) PV) for each projection in X.
This replacement is applied bottom-up when there are multiple project
operations in the graph pattern of EXISTS
.
Applying the renaming steps inside a sub-query does not change the solution mappings resulting from evaluating the sub-query. Remapping is only applied to variables not visible outside the sub-query. Renaming a variable in a SPARQL algebra expression causes the variable name used in bindings from evaluating the algebra expression to change. Since these are only variables that are not visible outside the sub-query, because they do not occur in the projection, the result of the sub-query is unchanged. SPARQL algebra expressions can not access the name of a variable nor introduce a variable except by remapping. Remapping is only applied to variables not visible outside the sub-query.
SPARQL syntactic forms that attempt to bind a variable through the use of
AS
that might already be in a solution mapping are forbidden in
SPARQL: this is covered in the syntactic restrictions of
19.8 Grammar,
notes 12 and 13.
This proposal adds the restriction that any variables in a current
row, the set of variables
in-scope
of the FILTER containing EXISTS, can not be assigned with the extend
algebra function linked to the AS
syntax.
In addition, any use of VALUES
in the EXISTS expression must not
use a variable in the current row.
The proposal is to retain the variables from the current row, not substitute them for RDF terms, before evaluation, and also to restrict the binding of the solution to the RDF term of the current row. This occurs after renaming.
Binding for variables occur in several places in SPARQL:
Graph(var,P)
involving a variable (from the syntax GRAPH ?variable {...}
)
Note that other places where solution mappings add variables are in
extend
function (connected to the AS
syntax)
and a multiset
from VALUES
syntax.
Limitations on Assignment
forbid this being of variables of the current row.
Restricting the RDF Terms for a variable binding is done using inline data that is joined with the evalaution of the basic graph pattern, property path or graph match.
For solution mapping μ, define Table(μ) to be the multiset formed from μ.
Table(μ) = { μ }
Card[μ] = 1
Define the Values Insertion function Replace(X, μ) to replace each occurence Y of a Basic Graph Pattern, Property Path Expression, Graph(Var, pattern) in X with join(Y, Table(μ)).
The evaluation of EXISTS
is defined as:
Let μ be the current solution mapping for a filter and X a graph pattern, define the Evaluation of Exists exists(X)
exists(X) = true if and only if eval(D(G), Replace(PrjMap(X), μ) is a non-empty solution sequence.
exists(X) = false otherwise
This section addresses each issue identified, given the proposal above.
This can be handled by handling solution sequences as graph patterns where needed by adding toMultiSet as is done for SubSelect in 18.2.2.6 Translate Graph Patterns with a a correction to the text at the end of Section 18.2 introductory paragraph.
query-errata-N: "Section 18.2 Translation to the SPARQL Algebra" intro (end): ToMultiSet can be used where a graph pattern is mentioned below because the outcome of evaluating a graph pattern is a multiset. Multisets of solution mappings are elements of the SPARQL algebra. Multisets of solution mappings count as graph patterns.
Rather then replace a varialbe by its value in the current row, the new
mechanism makes the binding of variable to value available. The variable
remains in the graph pattern of EXISTS
and the evaluation.
By making the current row, which can include blank nodes, available, and not modifying the BGP by substitution, no blank nodes are introduced into the evalaution of the BGP. Instead, the possible solutions is restricted by the current row.
Issue 4 is addressed because variablea re not removed from the domain of
MINUS
. This propsoal does not preserve all uses MINUS
expressions; the problem identified in issue 4 is considered to be a bug in the original
SPARQL specification.
Issue 5 is addressed by noting that variables inside sub-queries which are not projected can be renamed without affecting the sub-query results. Whether to preserve that invariant or allow the variables to be set by the current row is a choice point - this design preserves the independence of disconnected variables.
The proposal described in this document does not cover use of variables from the current row in a HAVING clause.
This proposal for EXISTS
emphasizes simplicity and
the SPARQL notions of variable scoping and bottom-up evaluation of sub-queries over
maximum compatability with the current SPARQL definition
for EXISTS
.
Its basic idea is to inject values for the variables in-scope just
outside a FILTER
expression at the beginning of the
pattern argument to EXISTS
almost as if
a VALUES
construct was injected there.
FILTER
are in-scope at the beginning of the pattern argument to
any EXISTS
in the FILTER
expression. (This is
independent of the change to a mapping-based definition but fixes an
error that affects EXISTS
.) The additional scoping rule could
be written as:FILTER ... EXISTS { P } ...
with P
not a sub-select,
if v
is in-scope from the preceeding elements in the group graph
pattern in which the FILTER
is used then v
is in-scope for an empty
BGP that is considered to be inserted just before P
.
Initial
, to the SPARQL syntax and
algebra. Initial
will be used to set up the initial multiset of solution
mappings inside an EXISTS
. It will work much like VALUES
except
that it will transfer solution mappings through the EXISTS
instead of setting up a constant solution mapping.
FILTER
elements replace
EXISTS P
in the filter expression with
exists(Initial(t),translate(Initial(t) P'))
where t
is a fresh token, and similarly for NOT
EXISTS {P}
. If P
is
a SubSelect then P'
is {P}
otherwise P'
is just P
.
Initial(t)
as itself. exists
function to: μ
be the current solution mapping for a filter, t
a token,
and P
a graph pattern: The value exists(Initial(t),P)
given D(G)
is true iff
eval(D(G),P')
is a non-empty multiset of solution
bindings, where P' is P
with
Initial(t)
replaced by {μ}
.
The same solution as for Proposal B can be used here to solve Issue 1.
There is no substitution of variables by values (including blank nodes) thus solving Issues 2, 3, and 4.
As the value-injection is done only at the beginning of the
argument to EXISTS
it will not affect disconnected
variables, such solving Issue 5.
This section records other possible errata discovered by the community group.
Near the end of section "18.2.2.6 Translate Graph" the translation of inline data should result in a ToMultiSet(data).
Translation of a trailing VALUES already uses this form.
For each DataBlock, form a solution mapping from the variable in the corresponding position in list of variables (or single variable), omitting a binding if the BindingValue is the word UNDEF.
Combine each solution mappings into a solution sequence S, and apply ToMultiSet.
The result is ToMultiset(S)
The text in 18.2.4.3 then needs to be aligned: ToMultiSet(data) becomes translate(data).
This proposal does not introduce any additional security or privacy considerations to SPARQL 1.1.