My 2nd-week session was talking about how I data choose my data storage. My mission in this coaching is building data pipelines with e-commerce data set and only use the open-source tool. Before doing the project, I honestly have no idea where/how I can start, so my instructor recommend me there was one student who used the e-commerce data set I choose, and she built her data pipelines on AWS. My main purpose is to apply data engineering skillset in my industry in particular our new test framework. Therefore, I should use the open-source tools in our on-premise server in my company since my company does not allow to use of any tools from main cloud providers(AWS, Azure, GCP). It truly fits well in my circumstance. The picture below represents the streaming pipelines architecture I drew referring to Richie’s GitHub (https://github.com/Richie-Kwon/ecommercedata).
She used AWS product on her data pipeline, and I draw the component architecture using the open-source tools.
Now, the instructor wants me to think about which database I should choose for my project because I chose Cassandra and PostgreSQL corresponding to Redshift and DynamoDB respectively. Before the session, I did not understand what’s the data warehouse and when we use it, and I did not understand correctly what the NoSQL is. I misunderstood NoSQL is just literally the opposite meaning of SQL.
Here is my ToDo list to complete by the beginning of the next session.
- Data Warehouse vs other Databases understand OLTP vs OLAP
- Find alternatives to Cassandra, other NoSQL stores (what are the differences)
- Find a Data warehousing solution we could use
reference:
DB-engines; Details about DBs: https://db-engines.com/en/
Choosing Datastores; The lecture series on Leardataengineering platform: https://learndataengineering.com/courses
OLTP vs OLAP, what’s Data Warehouse?
Operational Databases(OLTP): This is used for the main business case. Data should be very structured. This can use transactions. (A transaction is a unit of changing the drive state, put it simply, available to use SQL modified row values on the data table such as Mostly INSERT, DELETE, UPDATE). Users can use data by small transactions but often, for example, purchase history in the platform. Other business systems — checking /searching something in the dataset.
How to handle a larger number of small transactions?
by CS329s
Analytical Databases(OLAP): More flexible — see the entire data set, larger data set for business intelligence, getting insight. Less structured than OTLP case. When data gets in OLTP use transaction but OLAP gets data in by copying. OLAP is used for more analytical like finding out the highest growth customer in month/year or finding out the best customer in the year. Data analysts, machine learning practitioners can access this. They can use a lot of types of databases or data warehouses like redshift or snowflake. These things are very well distributed and have a big processing power. These are working on a lot of data. Operations are mostly SELECT
How to get aggregated information from a large amount of data?
by CS329s
To find the meaning of the data warehouse we do to know the structured and unstructured data. The structured data is following by the schema, and if you’re committed to a data schema in storage, then it assumes data has already been processed. The repository of the structured data is called a data warehouse, but the repo with unstructured data(raw data)is called a data lake.
Find an alternative to Cassandra, other NoSQL stores (Differences?)
The type of NoSQL: Document, wide column, graph, key-value, and time-series database. Cassandra is categorized as a wide-column group; I was looking for the alternative AWS DynamoDB as an open-source tool. Dynamo is in the key-value group. So I should try to find out other tools in the key-value group.
TBD (other NoSQL stores (Differences?)) — I will read “Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement 2nd Edition”
Find data warehousing we could use?
Reference: database ranking website: https://db-engines.com/en/ ; data-engineer roadmap: https://github.com/datastacktv/data-engineer-roadmap
I should choose the tool as a data warehouse among Snowflake, Presto, Apache Hive, Apache Impala, ClickHouse.