The Open Anzo Project

Semantic Application Middleware

Ticket #322 (closed defect: fixed)

Opened 8 months ago

Last modified 3 months ago

SPARQLAlgebra.join returns incorrect results when Rdb and Mem rdf node types are mixed.

Reported by: jpbetz Assigned to: jpbetz
Priority: major Milestone: 3.0 milestone2
Component: openanzo Version: 3.0
Keywords: hashCode equals compareTo Cc: lee mroy

Description (Last modified by jpbetz)

This is particularly difficult to replicate, but I will try to post something shortly.

This was exposed when a SolutionSet? containing some MemTypedLiterals was joined with RdbNodes of various types. The results of the join were clearly wrong and eventually it became clear it was only if typed literals were in the mem solution set that the issue occurred.

The culprit's are compareTo methods in the RdbLiteral and RdbURI node classes that compare by ID against each other rather than by lexical value. In isolation, these node classes seem to establish some kind of consistent order and can be joined against each other quite well, and very fast.

Unfortunately, this internal ordering doesn't seem to extend fully to mem nodes, particularly literal datatypes (blank nodes seems to have issues as well).

To verify theory, I have commented out (but not checked in) the 'by ID' portions of the compareTo methods in the RDB node classes. This forces the implementations to fall back to performing all these operations on the lexical values of the nodes. This does in fact result in correct join behavior. However, since there is a huge performance benefit to using the IDs for comparisons, I think it's clear that this issue is best resolved by fixing the ID scheme to handle literal datatype (and presumably lang) tags as well as verifying blank node labels are working okay. (I do not know how the IDs are selected or if this is possible, yet, but it does _seem_ to _almost_ work).

This is not an entirely unknown issue, from RdbLiteral.equals:

via the definition in:
        // http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality
        // we need only compare the lexicalizations of the two literals
        // since the lexicalizations include the language tag and datatype
        // parts

Note that this is a major issue since SPARQL queries behave erratically whenever not all the RDF is backed by RDB nodes.

Change History

05/05/08 03:01:54 changed by jpbetz

  • description changed.

05/05/08 03:04:27 changed by jpbetz

  • description changed.

05/05/08 03:11:48 changed by jpbetz

  • description changed.

05/05/08 08:52:00 changed by mroy

The byId methods will only work if all rdb data is coming from a single nodecentric datasource, since the id is the Value's sequence based NodeCentric? id. Comparing ids from 2 datasources would yield inconsistent results since the IDS are not shared between datasources plus comparing non RDB values to RDB values would most likely yield inconsistent results as well since some values would sort/compare on ids, while others on toString. In situations where data can come from mixed datasources, comparison based on ID is probably not possible.

05/05/08 11:25:06 changed by jpbetz

Okay, so as things stand, the equals, hashCode and natural compareTo in the RDB nodes is unsafe and must be stripped out.

We should still provide for an efficient byID comparator for joins when it's known that all the nodes in the join are Rdb nodes.

For the mixed join case, we may benefit from batch loading Rdb values. I'm still working on what our optimization options are here.

05/05/08 11:35:34 changed by jpbetz

By stripped out I mean the equals, hashCode and compareTo of RDB nodes cannot use node ID but must instead always use lexical value.

05/05/08 12:24:18 changed by jpbetz

The mixed join case has some potential for optimization:

1. If the ratio of mem nodes to rdb nodes is very small, if would be more efficient to get or create rdb nodes to match these mem nodes and use them in a byID join.

2. If the ratio of mem nodes to rdb nodes is very large, it would be more efficient to get the lexical values of the rdb nodes and use them in a lexical join.

05/05/08 14:30:11 changed by jpbetz

Following up on the optimization case #1, where the number of mem nodes is small and number of RDB nodes large. A simpler approach is possible:

Try to look an ID for the mem node, if found, use it

If no ID exists for the mem node, create a 'temp' ID for it that is outside the number range used for RDB nodes. A simple mechanism might be (memNode.hashCode % MAGIC_NUMBER + MAX_NODE_ID). This would eliminate the need insert a node centric entry for the mem nodes.

05/06/08 19:00:37 changed by jpbetz

(In [2239]) re #322

Disable ID equals, hashCode and compareTo temporarily.

05/23/08 00:08:37 changed by jpbetz

(In [2361]) re #322

Add basic test harness for SPARQLAlgebra. Move some things around so harness can be written. Drop a couple obsolete cli files.

05/26/08 02:12:12 changed by jpbetz

(In [2363]) re #322

Add custom comparator support to SPARQLAlgrbra. Fix anzo cli ask query output.

10/20/08 12:46:03 changed by mroy

  • status changed from new to closed.
  • resolution set to fixed.
Copyright © 2007 - 2008 OpenAnzo.org