We’re delighted to announce that Sphinx 2.1 begins support of JSON attributes. While complete support is yet to come (some quirks and limitations are yet to be ironed out), we consider this to be a major step ahead. Storing sparse key-value data is no longer a fundamental issue in Sphinx!
Motivation
Sphinx, like classic SQL databases, works with a so-called fixed schema, that is, a set of attribute columns. These work well when most of the data you store actually has values. However, mapping sparse data to static columns can be very cumbersome.
Assume for example that you’re running a price comparison or an auction site with many different products categories. Some of the attributes like the price or the vendor are identical across all goods. But from there, for laptops, you also need to store the weight, screen size, HDD type, RAM size, etc. And, say, for shovels, you probably wanna store the color, the handle length, and so on. So it’s manageable across a single category, but all the distinct fields that you need for all the goods across all the categories are legion.
The JSON field can be used to overcome this. Inside the JSON attribute you don’t need a fixed structure. You can have various keys which may or may not be present in all documents. When you try to filter on one of these keys, Sphinx will ignore documents that don’t have the key in the JSON attribute and will work only with those documents that have it.
How to index it
JSON attributes are available for both SQL and XML sources, and both in disk and RT indexes. Here’s an SQL source example:
source mysource
{
sql_query = SELECT id, content, metadata from mytable
sql_attr_json = metadata
} |
For this attribute, Sphinx expects a text field that contains JSON formatted data.
As we speak, the attribute can accept only one level JSON objects. This will change, and pretty shortly, but right now, multi-level JSON objects will not be indexed. Integer, float, and string array values are accepted.
An example of an accepted JSON object:
{ "id": 1, "gid": 2, "title": "some title", "tags": [ "tag1", "tag2", "tag3" ] } |
And here’s something that isn’t accepted right now (but will be in the future):
{ "id": 1, "gid": 2, "title": "some title", "tags": { "name": "tag", "value": "tagvalue" } } |
Error handling
Along with the new attribute, Sphinx’s indexer adds 2 new options:
json_autoconv_numbers
indexer
{
...
json_autoconv_numbers = 1
...
} |
When json_autoconv_numbers is 1, the parser will automatically detect and convert possible string values that represent a number into a numeric attribute. When autoconv is set to 0, values like “1234″ (mind the quotes!) will be stored as a string rather than a number, and then ignored in scalar contexts (such as, say, WHERE price<100 conditions). The default behavior is to conform to JSON strictly and not perform automatic conversion. However autoconv comes in pretty handy in situations when your data source and/or application is not too picky about the types.
For a complete example, consider that we have:
{ ... "gid":"123" ... } |
If autoconv is 1 , the gid attribute will become an integer number and you can filter it like obj.gid=123. Otherwise, it will stay as a string and only be matched by a string comparison like obj.gid=”123″.
on_json_attr_error
indexer
{
...
on_json_attr_error = ignore_attr
...
} |
By default, if a document has a JSON attribute with bad formatting, the indexing will not fail. The indexer will only show a warning and the failed field will be recorded as an empty JSON object. Setting on_json_attr_error = fail_index will make indexing fail at the first JSON parsing error.
Using the JSON object in searching
Let’s consider a simple index, like
source json_test
{
...
sql_query = select * from test
sql_attr_json = j
} |
Two types of filtering can be done at this moment: scalar filtering and string comparison. You can access keys in JSON objects by using obj.key. All documents may not have the filtered key in the JSON object and so only those having the key are considered. The key value is then compared with the filtering value.
mysql> select * from json_test where j.uid>100; +------+------+-------------------------------------+ | id | gid | j | +------+------+-------------------------------------+ | 1 | 1 | {"name":"Alice","uid":123} | | 2 | 1 | {"name":"Bob","uid":234,"gid":12} | | 3 | 2 | {"name":"Charlie","uid":345} | | 4 | 2 | {"name":"Damon","uid":456,"gid":23} | +------+------+-------------------------------------+ 4 rows in set (0.00 sec) mysql> select * from json_test where j.uid=123; +------+------+----------------------------+ | id | gid | j | +------+------+----------------------------+ | 1 | 1 | {"name":"Alice","uid":123} | +------+------+----------------------------+ 1 row in set (0.00 sec) |
String comparison:
mysql> select * from json_test where j.name='Alice'; +------+------+----------------------------+ | id | gid | j | +------+------+----------------------------+ | 1 | 1 | {"name":"Alice","uid":123} | +------+------+----------------------------+ 1 row in set (0.00 sec) |
Sorting
Sorting JSON objects is similar to sorting anything else, you can have multiple ordering rules:
mysql> select id, j from json_test order by j.name asc, j.uid desc; +------+----------------------------------------+ | id | j | +------+----------------------------------------+ | 9 | {} | | 15 | {} | | 5 | {"12":345,"34":567} | | 10 | {} | | 6 | {"12":346,"179":971} | | 8 | {"sv":["one","two","three"],"gid":315} | | 20 | {"name":"Alice","uid":450} | | 1 | {"name":"Alice","uid":123} | | 2 | {"name":"Bob","uid":234,"gid":12} | | 3 | {"name":"Charlie","uid":345} | | 4 | {"name":"Damon","uid":456,"gid":23} | +------+----------------------------------------+ 11 rows in set (0.00 sec) |
When sorting by ascending values, documents which don’t have the sorting key will come first. Documents that have a sorting key in the JSON attribute will then be sorted by the key value. If a second sorting rule is present, like in the example above, documents with identical values on the first key will be sorted by the second key.
When sorting by descending values, the documents that don’t have the sorting key present in the JSON attribute will appear last. These documents don’t have any default sorting rule yet, they are sorted in the order they are picked up from the index.
mysql> select id, j from json_test order by j.name desc, j.uid asc; +------+----------------------------------------+ | id | j | +------+----------------------------------------+ | 4 | {"name":"Damon","uid":456,"gid":23} | | 3 | {"name":"Charlie","uid":345} | | 2 | {"name":"Bob","uid":234,"gid":12} | | 1 | {"name":"Alice","uid":123} | | 20 | {"name":"Alice","uid":450} | | 8 | {"sv":["one","two","three"],"gid":315} | | 6 | {"12":346,"179":971} | | 15 | {} | | 10 | {} | | 9 | {} | | 5 | {"12":345,"34":567} | +------+----------------------------------------+ 11 rows in set (0.00 sec) |
Last but not least, grouping also works. So all the standard clauses, i.e., WHERE, ORDER BY, and GROUP BY do work with JSON attributes.
Fetching individual sub-columns
This is rather self-explanatory:
mysql> select id, j.price from json; +------+---------+ | id | j.price | +------+---------+ | 1 | 100 | | 2 | 200 | | 3 | NULL | +------+---------+ 3 rows in set (0.00 sec) |
Conclusion
Sphinx 2.1 adds partial but already functional and pretty useful JSON support. (We’re actually aware of a few customers already rolling this out in production.) You can now store sparse key-value pairs into your Sphinx documents, and use WHERE, ORDER BY, and GROUP BY on top of that data. Sub-object support and advanced querying syntax will definitely be coming up in the next releases, too.
Watch for next week’s blog! We’ve got more to say about what you can expect from Sphinx 2.1!
| « January 29, 2013. A new tool in the trunk: wordbreaker | February 12, 2013. Comparing Compilers » |


very nice improvement=)
Say please how sphinx uses json attribute when execute query.
Does sphinx do full scan or it is stored like real index (b-index e.g.) and sphinx does quick select with O(log(n)) cost?
@boa, they are first-class attributes, and we don’t have any attribute indexes yet. So every full-text or full-scan match gets examined.
However “is this key in that JSON” checks are accelerated using a tiny Bloom filter.
I think this will become a game changing feature for sphinx, congrats! Can’t wait for the json sub-object feature.
Just wondering if aggregate functions AVG(), MAX(), MIN(), SUM() will be supported in json attributes? Don’t appear to be functional yet, but would be great if it could!
@Mark: I would say yes, but shodan can confirm. The json attribute it’s still in the early days, but lot of things will come in future, for example several days ago in trunk partial support for the json attribute was added in RT indexes.
Why it doesn’t work “where j.attr1=7 and j.attr2=9″?
+——+————+————+———————–+
| id | name | product_id | j |
+——+————+————+———————–+
| 7 | катя | 30 | {“attr1″:7,”attr2″:9} |
Version: Sphinx 2.1.1-beta (rel21-r3701)
Aggregate functions (with explicit type conversion) and multiple conditions have been added in r3742 and r3751.
In case like
{
“tags”: [
"tag1",
"tag2",
"tag3"
]
}
where the value is set, do sql word accordingly? Fore example, find_in_set?