I came across this article on schema/data modelling in Cassandra vs. a traditional SQL counterpart. In my current job, we have an input stream that comprises of several files through the day that each contain the same structured data. Currently we have an ETL tool that ingests these files and spits them out into different SQL Server tables. There are some ETL jobs that then take this data and further denormalize it. We have two issues at a database level. Firstly, some performance issues, but I am not convinced they are due to this being a relational database. I think in most cases we could use SQL features and techniques like indexing, denormalization, etc. to alleviate most of these performance issues. The second concern is size. Due to the size of the tables, the indexes become huge over time. If your query is not optimized, one bad read on this table can kill you. If you need to perform a time consuming job, it is going to lock up the table and the ingestion essentially has to be stopped. I have been thinking about whether a NoSQL option like Cassandra would help in this case. Cassandra gives you horizontal scaling and faster reads. Cassandra also encourages (careful) data duplication and denormalization. The whole one table per query type thinking may help.
That’s all. No solutions for today. Just thoughts and questions.
I came across this great article on Microsoft Azure Docs on NoSQL vs. SQL. In the development world, there are new technologies coming down like rain every day. It is easy to get caught up in the latest and biggest trend and have a tendency to replace your current favorite technology (a hammer) for any problem (a nail) with a different technology (a different hammer). It is important to not lose focus of what are the true applications of any new technology, and when to use it or not use it.
The Microsoft article example gives a great example on a social site where you may have a user making a post with different media that get comments and likes by other users. To think of it in a purely relational database sense, you may end up creating different tables to host users, posts, media types, comments, etc. with one-to-many or many-to-many relationships going every which way. And to do something simple like showing a post from a user may require you to run joins on several of these tables. Definitely not great for performance.
In comparison, in a document based NoSQL database, you could have entire documents saved with all the relevant information for a particular post, assigned to a user. It would be very performant unlike the multi-table, multi-relationship joins all over solution an RDBMS would offer.
There are things that relational databases are good at, for instance
- Relational Queries
- Defined and uniform table structure (all entries have same fields)
- Well Defined Schema (though adding properties requires more work)
- Structured Data
- Vertical Scaling (More RAM, More Processing Power)
and there are things NoSQL storage is good at, for instance
- Non-relational data (JSON, key-value pairs, etc.)
- Ease of adding new properties
- Unstructured data
- Availability of Consistency (CAP Theorem)
- Horizontal Scaling (Add Servers)