anonymous user

Forums   Register   Login   Forgot your login/password?   Search

Index design - multiple text labels and multiple locations

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 497 | 498 | 499 | 500 | next »» | Create new thread

orls

Name: Owen Smith
Posts: 4

2011-11-21 13:49:52 | reply!


Hi All, I'm wrestling with a sphinx index design problem.

My objects all have an arbitrary-length list of text labels, which I need integer
metadata associated with for weighting and filtering.

But these objects also have an arbitrary number of geographic locations associated with
them, as lat/lng pairs.

One of these objects in JSON might look like this:

SomeObject {
    id: 12345,
    labels: [
        {label:"Label 1", age:10},
        {label:"Another Label", age:100},
        {label:"Yet Another Label", age:350}
    ],
    locations: [
        {lat:0.123, lng:0.123},
        {lat:4.567, lng:4.567}
    ]
}

Location1, (lat, lng)
Location1, (lat, lng)

How can I construct a sphinx index that allows me to search by both name and location,
preserving the metadata associated with each text label in a way that I can filter on /
include in sort_expr?

My current design, which deals only with the labels so far, works well. It has a large
index for all the labels, each with a unique ID, and the 'real' object ID as an
attribute, which I group on:

(Note that all these examples are dramatically simplified, there's a lot more going on
in, for example, the customweight calc)

        # Source in conf:
        sql_query = SELECT id, real_object_id, age, label FROM all_object_labels
        sql_attr_uint = real_object_id
        sql_attr_uint = age

        //Query-time PHP API calls:
        $sphinx->SetSelect("*, (@weight / @age) as customweight");
        $sphinx->SetGroupBy("real_object_id", SPH_GROUPBY_ATTR, "customweight desc");
        $sphinx->Query("@label " . $users_search_terms, "LabelsIndex")

This gives me the best labels match, weighted down by age, with one hit for every 'real'
object.

I can imagine a similar solution for the locations: construct an index of just the
locations, with no free text fields, using a fake id to index every location, calling
SetGeoAnchor, grouping on the real_object_id. I've built indexes like that before that
work well when searching for just the location.

But I can't see how I can construct these indexes in a way that allows me to do a
free-text search on the labels *and* set a geoanchor/search by radius.

Ideally what I would want is sphinx to support multiple lat/lng pairs per-document, and
multiple values for a free-text field, but that's asking a bit too much :)

I could in theory construct a massively denormalized index, that had one document for
every combination of label and location:

1, real_id, label1, location1, ...
2, real_id, label2, location1, ...
3, real_id, label3, location1, ...
4, real_id, label1, location2, ...
5, real_id, label2, location2, ...
6, real_id, label3, location2, ...

But my index is going to get really big really fast and the SQL queries would get very
hairy. An object with 5 labels and 4 locations would yield 20 index entries.

Anyone got any bright ideas about how to query multiple text values and multiple
locations -- at the same time? If you can't tell, I'm relatively new to Sphinx index
design.

I'd like to avoid having to do too many sphinx queries, or manual checking of duplicates
on the PHP side if possible, but I'm willing to entertain anything at this point :)

barryhunter

Name: Barry Hunter
Posts: 6879

to: orls, 2011-11-21 13:58:32 | reply!


What geo-queries do want to run? Is it just "filter results to be all within N meters of
X,Y" - then sorted by your weight

Or should the 'distance' somehow factor into customweight ? (eg closer matches rank
higher?)



... in general I think the best way would be to move away from using sphinxes built in
geo-searching*, and instead build some sort of geospatial index - probably using fake
keywords. Can make a more specific suggestion with answer to above.






* Which might not be that quick if you have large number of documents anyway.

orls

Name: Owen Smith
Posts: 4

to: barryhunter, 2011-11-21 14:03:09 | reply!


> What geo-queries do want to run? Is it just "filter results to be all within N meters of
> X,Y" - then sorted by your weight
>
> Or should the 'distance' somehow factor into customweight ? (eg closer matches rank
> higher?)

I'd like to be able to factor in the distance to the final ordering, yes.

> ... in general I think the best way would be to move away from using sphinxes built in
> geo-searching*, and instead build some sort of geospatial index - probably using fake
> keywords. Can make a more specific suggestion with answer to above.

Hm, interesting -- any thoughts you have on that would be greatly appreciated.

> * Which might not be that quick if you have large number of documents anyway.

Oh, really? Do you know of any benchmarks on this? I do have a lot of object (~10million)
and we're currently only doing geo searches on one of the locations each (hence my
question), sphinx seems to be searching fast enough so far.

barryhunter

Name: Barry Hunter
Posts: 6879

to: orls, 2011-11-21 14:09:53 | reply!


One last question, is the 'distance' filter need to be absolute perfect?

Ie does have to be withing exactly 5km. Or would somewhere between 4 and 6 be ok?


> > * Which might not be that quick if you have large number of documents anyway.
>
> Oh, really? Do you know of any benchmarks on this? I do have a lot of object (~10million)
> and we're currently only doing geo searches on one of the locations each (hence my
> question), sphinx seems to be searching fast enough so far.

I do! My site uses sphinx to index geolocated photos. See rudimentry numbers posted here
http://sphinxsearch.com/forum/view.html?id=8082

Although I notice I dont have a pure GeoAnchor radios test. Will see if can find the test
code to try that. Expect it to be much worse that bbox anyway.

orls

Name: Owen Smith
Posts: 4

to: barryhunter, 2011-11-21 14:15:30 | reply!


> One last question, is the 'distance' filter need to be absolute perfect?
>
> Ie does have to be withing exactly 5km. Or would somewhere between 4 and 6 be ok?

Exact distance is unimportant (there's already a lot of vaguery in the radii from to
geocoding of the user input), but the relative ordering of results is important.

Thanks for the benchmark notes/link.

barryhunter

Name: Barry Hunter
Posts: 6879

to: orls, 2011-11-21 14:35:14 | reply!


Ok, yes I do think the tiles design could work.

Create fake keywords for you locations. The idea is you can then search using these
keywords to find results here. This is a somewhat complex idea to explain, but will
try... (also try searching this forum for 'tiles' will reveal previous threads)

eg given
    {lat:0.123, lng:0.123},

Can make fake keywords

0123x0123 012x012 01x01

(of course need to do more to allow for North/South etc, but lets keep this example
simple :)

Then to search for very near results can add

@loc 0123x0123

to your Full-text query.

Or even

@loc (0123x0123 | 0124x0123 | 0123x0124 | 0124x0124 ...

to include the 'surrounding' tiles.

Can then choose the approriate tile size depending on the search query radius.



... the refinement to this to allow 'nearby' results to show higher could be demonstarted
by the following query

@loc (0123x0123 | 012x012 | 01x01)

So results that very near, match all three terms = rank high*. But slightly futher one,
will only match the last two. Whereas far results will match all three terms.

Of course each 'scale' of tile, have multipler terms - as above to capture surounding.


  Keep with your "label_id" as the document_id. But attach the multiple location tiles to
  each row. Sphinx doesnt know or care that there are multiple different tiles on each
  record.



... Indexing this is generally quite easy generating the tiles, could be done with FORMAT
and CONCAT in mysql.

But querying does become more complex, as you have to generate all the right tiles, but
they are easily computed, once you decided on the tile size to use.



(I was asking now 'accurate' you want the filtering, because these tiles are obvoisoully
working on tile boundaries, so its somewhat approximate (although you can just use
smaller tile sizes - but then the queries become really long). And ordering by distance
is more granular too (depending on your tilesize))



* Although the only inbuilt ranking mode where this works well is wordcount. Perhaps use
the new expression ranker to make a combination - combining hit_count and lcs/bm25

orls

Name: Owen Smith
Posts: 4

to: barryhunter, 2011-11-21 17:31:53 | reply!


Ok, I see. What you're describing is pretty similar to geohashing ( http://geohash.org/
) -- you might be able to improve the speed and tidiness of this kind of search using
geohashes. Because they always share prefixes, you can do broad region searches by just
shortening the hash and putting a wildcard after it.

I'm concerned that this would be too fiddly and introduce too many odd cases, e.g. around
boundaries -- thanks for the help, though.

barryhunter

Name: Barry Hunter
Posts: 6879

to: orls, 2011-11-21 17:47:29 | reply!


> Ok, I see. What you're describing is pretty similar to geohashing ( http://geohash.org/ )

Yes. Geohashes could be used to make the tiles. The benefit of simple lat/long, is easier
indexing (dont need to compute and store the geohash in the index).
And avoids making the potentially big list of geohashes to search during query.

On the other hand they can be quickly indexed using sphinxes min_prefix_len support :)

The other issue geohashes, is the assumetic 'rectangles' that each hash preresents. Lop
off one charactor and you totally change the shape (not just the size) if the area. Makes
it harder (but not impossible!) to calculate neghbouring tiles.


>
> I'm concerned that this would be too fiddly and introduce too many odd cases, e.g. around
> boundaries -- thanks for the help, though.

Well yes, the idea is to make it 'fuzzy' for a serious increase in speed.

------

Like the old adage, "pick two: quickness, simplicity, accuracy"

Easy to make simple and accurate. Just loop though all results, check each distance. Wont
be able to do it all in sphinx, hence slow.

Can make it quick and accurate, but will be insanely complicated. (probably using
quad-trees?)

Or can make it quick and (relatively) simple - using the above tiles method. Sacrifice
accuracy.

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 497 | 498 | 499 | 500 | next »» | Create new thread