Issues with Indexing while using Cassandra.
We have a single machine cassandra setup on which we are trying different things for analytics. One of Column family we have goes with this discription:
CREATE TABLE playground.event_user_table (
event_date date,
event_time timestamp,
author text,
content_id text,
content_model text,
event_id text,
event_type text,
PRIMARY KEY (event_date, event_time)
) WITH CLUSTERING ORDER BY (event_time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX author ON playground.event_user_table (author);
With this table, we populated data in it for different apps/models. Now when we query system with something like:
cqlsh:playground> select * from event_user_table where event_date = '2013-06-02' ;
event_date | event_time | author | content_id | content_model | event_id | event_type
------------+--------------------------+----------------------------------------+-------------------------------------------------+---------------+----------+------------
2013-06-02 | 2013-06-02 00:00:00+0000 | anuragabes@example.com | anuragabes@example.com##2##2012 | A | | submitted
2013-06-02 | 2013-06-02 01:28:13+0000 | dear_kirnesh@example.com | 1000910424 | B | | closed
2013-06-02 | 2013-06-02 01:59:31+0000 | dsammaiah@example.com | dsammaiah@example.com##1##2013 | A | | created
2013-06-02 | 2013-06-02 02:00:44+0000 | jacobg@example.com | jacobg@example.com##5##2013 | A | | created
2013-06-02 | 2013-06-02 02:02:16+0000 | kriti.jneja@example.com | kriti.jneja@example.com##4##2013 | A | | created
Result looks good and as expected. Now I query system on the secondary index of author and I get empty or partial results:
cqlsh:playground> select * from event_user_table where author = 'anuragabes@example.com' ;
event_date | event_time | author | content_id | content_model | event_id | event_type
------------+------------+--------+------------+---------------+----------+------------
(0 rows)
cqlsh:playground> select * from event_user_table where author = 'senthil.ramachandran@example.com' ;
event_date | event_time | author | content_id | content_model | event_id | event_type
------------+--------------------------+------------------------------------+------------+---------------+----------+------------
2014-01-18 | 2014-01-18 09:01:52+0000 | senthil.ramachandran@example.com | 1001068325 | SRF | | closed
(1 rows)
And I have tried this combinations of PRIMARY KEY too ((event_date, event_time), author)
but with same results. There are known issues with secondary indexes
and scaling1 but it affects single node systems too? I am not
sure about it. Time to confirm things.
Update1 2, Cassandra has "'lazy' updating to secondary indexes. When you change an indexed value, you need to remove the old value from the index." Could that be the reason?
: As mentioned here