Skip to main content

Issues with Indexing while using Cassandra.

We have a single machine cassandra setup on which we are trying different things for analytics. One of Column family we have goes with this discription:

CREATE TABLE playground.event_user_table (
    event_date date,
    event_time timestamp,
    author text,
    content_id text,
    content_model text,
    event_id text,
    event_type text,
    PRIMARY KEY (event_date, event_time)
) WITH CLUSTERING ORDER BY (event_time ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX author ON playground.event_user_table (author);

With this table, we populated data in it for different apps/models. Now when we query system with something like:

cqlsh:playground> select * from event_user_table where event_date = '2013-06-02' ;
 event_date | event_time               | author                                 | content_id                                      | content_model | event_id | event_type
------------+--------------------------+----------------------------------------+-------------------------------------------------+---------------+----------+------------
 2013-06-02 | 2013-06-02 00:00:00+0000 |                 anuragabes@example.com |                 anuragabes@example.com##2##2012 |           A   |          |  submitted
 2013-06-02 | 2013-06-02 01:28:13+0000 |               dear_kirnesh@example.com |                                      1000910424 |           B   |          |     closed
 2013-06-02 | 2013-06-02 01:59:31+0000 |                  dsammaiah@example.com |                  dsammaiah@example.com##1##2013 |           A   |          |    created
 2013-06-02 | 2013-06-02 02:00:44+0000 |                     jacobg@example.com |                     jacobg@example.com##5##2013 |           A   |          |    created
 2013-06-02 | 2013-06-02 02:02:16+0000 |                kriti.jneja@example.com |                kriti.jneja@example.com##4##2013 |           A   |          |    created

Result looks good and as expected. Now I query system on the secondary index of author and I get empty or partial results:

cqlsh:playground> select * from event_user_table where author = 'anuragabes@example.com' ;
 event_date | event_time | author | content_id | content_model | event_id | event_type
------------+------------+--------+------------+---------------+----------+------------

(0 rows)

cqlsh:playground> select * from event_user_table where author = 'senthil.ramachandran@example.com' ;
 event_date | event_time               | author                             | content_id | content_model | event_id | event_type
------------+--------------------------+------------------------------------+------------+---------------+----------+------------
 2014-01-18 | 2014-01-18 09:01:52+0000 |   senthil.ramachandran@example.com | 1001068325 |           SRF |          |     closed

(1 rows)

And I have tried this combinations of PRIMARY KEY too ((event_date, event_time), author) but with same results. There are known issues with secondary indexes and scaling1 but it affects single node systems too? I am not sure about it. Time to confirm things.

Update1 <2016-02-10 Wed 15:45>: As mentioned here2, Cassandra has "'lazy' updating to secondary indexes. When you change an indexed value, you need to remove the old value from the index." Could that be the reason?