Hacker News new | past | comments | ask | show | jobs | submit login
The Rise and Fall of the OLAP Cube (2020) (holistics.io)
211 points by smartmic on July 5, 2021 | hide | past | favorite | 91 comments



"We model data according to rigorous frameworks like Kimball or Inmon because we must regularly construct OLAP cubes for our analyses."

Actually, no. We model our data this way so it can be used for business decisions. It doesn't take long for any entity of any scale to discover that the logging and eventing done for heartbeat status, debugging, and scaling is just different than what you need to make a variety of business decisions.

You can solve with more events at different scales (button clicks nested in screen views) or pick events or event rolkups that appear to be clean business stages ("completed checkout") but still, your finance team, marketing group, all have different needs.

So, you decide to have some core shared metrics, derived and defined, and make them usable by everyone. Folks agree on the defns, and due to ease and trust, you see more data supporting more decisions.

You discover that some folks are doing 10 table joins to get an answer; it's fast but difficult to extend for new questions. You decide to build a view that solves some of these pains, and refactoring to allow a better time dimension. Your version links with the metrics you created, and the resulting queries shed tons of CTEs while becoming readable to the average user.

And now, you have some ELT pipelines, some event transforms that result in counts and filters that map nicely to your business needs but still allow you to get atomic raws, and you and your teams start to trust in consistent results. Your metrics are mostly clearly summable, and ones that aren't are in table views that precalc the "daily uniques" or other metrics that may need a bit special handling.

You've started modeling your data.

No, we don't need olap cubes. But we do need some type of rigor around analytic data. Otherwise, why go to all the trouble to collect it, count it, and predict from it if it may be wrong with no measure of that uncertainty?

And yeah, Kimball et al are from a world where olap was the answr, but it turns out they solved a broader set of sql problems. So, worth learning the good, toss the dated, and see what good data modeling can do for your analysis and predictions.


> Actually, no. We model our data this way so it can be used for business decisions

Yeah, that would be my disagreement as well.

Sure, some of Kimball might be obsoleted by modern technology. But I don't think the point of Kimball is cubes, and even if it was, what better is there?

I'd be really interested in what else is there, that is more modern and suited to the modern world.

Data Vault? That probably isn't it, most of all because it's all about the "platform to build you Kimball on" but not the actual business modelling. (But the parts about Metrics Vault and Quality Vault and the "service stuff" are really good.)

Anchor Modelling? That one seems, looking by 2021 eyes, like a way to implement columnar storage and schema-on-read in MS SQL Server using store procedures ... which is probably not actually a good idea.

Puppini Bridge aka. Unified Star Schema seems like interesting concept, especially in self-service space. But even it's proponents warn you it doesn't scale performance-wise and also it is kind of incremental change on Kimball. (But the way the bridge table is kinda adjacency matrix of graph of your objects tickles my inner computer scientist fancy)

So really, what else is there?


This isn't widely known, but activity schema[1] is aiming for this space (disclosure - I co-founded the company supporting it).

It's a modeling approach that separates data modeling from data querying -- meaning that once data is modeled, it can answer any number of questions. The base building blocks of the data model can be combined at query time without having to figure out explicit foreign key joins.

[1] https://www.narrator.ai/activity-schema/


Have you looked into the VKG (Virtual Knowledge Graph) approach?


BTW, their followup piece at https://www.holistics.io/blog/olap-is-not-olap-cube/ is a nicer drill down into some of these issues. Also a good read.


Could not have written this any better. I’ve been in the data business for over 10 years. A number of technologies have come and gone during that time, but dimensional modeling and solid SQL skills keep delivering remarkable value.


Indeed, it is generally good for analytics to have a conceptual data model that is reasonably efficient both on the technical side (to quickly get answers for queries), and also on the logical side (to quickly translate questions to queries).

This model doesn't need to be OLAP cubes, but it's also not that easy to find something better.


> No, we don't need olap cubes. But we do need some type of rigor around analytic data. Otherwise, why go to all the trouble to collect it, count it, and predict from it if it may be wrong with no measure of that uncertainty?

Modern data warehouses don't need to build cubes for performance reasons. Low latency data warehouses like ClickHouse or Druid can aggregate directly off source data. The biggest driver for modeling is allowing non-coders to access data and perform their own analyses. I don't see that problem ever going away. Cube modeling with dimensions and measures solves it well.


“Be suspicious of companies that are heavily locked into the OLAP cube workflow. (Learn how to do this here).“

The opposite is true too. Be suspicious of companies trying to lock you into their selfservice of data visualization tool. I have seen many BI tool vendors trying lock their clients into their tool. As result they get unnecessary complex models running.

I find the olap a pretty good mental model. That developers and users can understand.

The sheer amount of handwritten, tailor made etl or elt pipelines is something I like to see automated or replaced with something better.


Yup. At my job we ended up some way too important stuff "modelled in BI tool" in a way that nobody could understand how it actually works, and had to do full rewrite to model it in SQL.

But not all BI tools are alike - something like Superset which internally uses SQL and expects users to do self-service transformation using views is easy to industrialize since you already have the queries. Something like Tableau, not so much.

(If only Superset were more mature)


I'm a bit opinionated on data modelling, so a couple points.

1. "OLAP Cubes" arguably belong to Microsoft and refer to SQL Server cubes that require MDX queries. It's a solution given to us by Microsoft that comes with well understood features. "OLAP" itself is a general term used to describe any database used for analytical processing, so OLAP has a wide range of uses.

2. OLAP Cubes (as defined above) started to decrease in population in 2015 (I'd argue).

3. Any solution to fixing "OLAP" that comes from a commercial vendor is suspicious. As painful as Kimbal & Inmon are, they are ideas that don't require vendor lock in.

4. At my current company, we recently wrapped up a process where we set out to define the future of our DW. Our DW was encumbered by various design patterns and contractors that came and went over the past decade. We analyzed the various tables & ETLs to come up with a clear set of principles. The end result looks eerily close to Kimball but with our own naming conventions. Our new set of principles clearly solve our business needs. Point being you don't need Kimball, Inmon, Looker, etc to solve data modelling problems.

5. Columnar databases are so pervasive these days that you should no longer need to worry about a lack of options when choosing the right storage engine for your OLAP db.

6. More and more data marts are being defined by BI Teams w/little modeling/CS skills. To that end, I think it's important to educate your BI teams as a means to minimize the fallout, and be accommodating in your DW as means to work efficiently with your BI teams. This is to say settle on a set of data modelling principles that work for you, but may not work for someone else.


I'm struggling to figure out what you mean by number 4. Isn't the result of that exercise proof that the Kimball approach is a worthwhile one?


We evolved our guidelines for data modeling to something that closely resembles Kimball, but it's not Kimball.

Some of our ETLs were created by people that were not familiar with Kimball. We went about defining new data modeling guidelines, taking the good parts out of this work, eliminating the bad parts, and many of the good parts share a lot in common with Kimball.


To me this still sounds if the original person had known Kimball designs the bad parts might have been avoided since apparently the good stuff was a re-discovery of Kimball methods anyway?


I think the article is spot on about workflow but off on tech. Under the hood, Cubes will be back.

Snowflake is the cubeless column store.

And now firebolt comes along, promising order of magnitudes improvement. How? Largely “join indexes” and “aggregation indexes” which seem to smell like cubes.

So take the modern column store and slap some cubes in front.

In firebolt this seems to be added by programmers. But I think the future will see olap mpp engines that transparently pick cubes, maintain them, reuse them and discard them all automatically.


> How? Largely “join indexes” and “aggregation indexes” which seem to smell like cubes.

Just curious. What stops snowflake from adding these too. Trying to understand what innovation firebolt is doing.


Hopefully nothing.

I only understand these systems about as well as you can without actually fighting them in production, so ymmv ;)

There are things that Firebolt are doing regards SSD caching that its easy to see are going to massively help performance but also hinder one nice aspect of Snowflake which is that one Snowflake region can read-only access the storage of another. I'm guessing that this is via direct S3 access, but the moment there is tiered storage, inter-cluster and inter-region access would have to be inter-compute-cluster instead etc.


The OLAP cube needs to come back, as an optimization of a data-warehouse-centric workflow. If you are routinely running queries like:

    SELECT dimension, measure
    FROM table
    WHERE filter = ?
    GROUP BY 1
You can save a lot of compute time by creating a materialized view [1] of:

    SELECT dimension, filter, measure
    FROM table
    GROUP BY 1, 2
and the query optimizer will even automatically redirect your query for you! At this point, the main thing we need is for BI tools to take advantage of this capability in the background.

[1] https://docs.snowflake.com/en/user-guide/views-materialized....


But those queries aren’t equivalent so how is anything saved by materializing the second one?

e:

I believe (I could be wrong!) you edited the second query from

  SELECT dimension, measure
    FROM table
    GROUP BY 1
To

  SELECT dimension, filter, measure
    FROM table
    GROUP BY 1, 2
This addresses the filtering but how is that any different from the original table? Presumably `table` could have been a finer grain than the filter and dimension but you’d do better to add the rest of the dimensions as well, at which point you’re most of the way to a star schema.

This kind of pre-computed aggregate is typical in data warehousing. But is it really an “OLAP cube”?

In general I agree there is value in the methods of the past and we would be well served to adapt those concepts to our work today.


By grouping them you need to include aggregate functions like avg,max,min,sum,count for measure and dimension.


It’s much smaller than the original table. If you compute lots of these, then voila, you have an OLAP cube.


Well the size savings is a function of the number of included dimensions and the original table grain.

I wouldn’t call this an “OLAP Cube”. It’s just an aggregated fact table. A collection of those with their corresponding dimensions is a “data mart”.


A data mart is a logical organization of data to help humans understand the schema. What I am describing is a physical optimization, extremely similar to what an OLAP cube would do, but implemented on top of a SQL data warehouse. It’s an orthogonal concept to a data mart.


I guess I still don’t know what an OLAP Cube is.


He did edit his comment, but unfortunately didn't acknowledge the edit.


looker does exactly this, though you do have to specify which dimensions to aggregate: https://docs.looker.com/data-modeling/learning-lookml/aggreg...


While native materialized view feature is a great start, unfortunately they're not useful in a practical way if you have data lineage. They works like a black-box and they can't guarantee the query performance.

The new generation ELT tools such as dbt partially solve this problem. You can model your data and incrementally update the tables that can be used in your BI tools. Looker's Aggregate Awareness is also a great start but unfortunately it only works for Looker.

We try to solve this problem with metriql as well: https://metriql.com/introduction/aggregates The idea is to define these measures & dimensions once and use it everyone; your BI tools, data science tools, etc.

Disclaimer: I'm the tech lead of the project.


Is that related to lightdash.com somehow?

It seems like a very similar technology and also the webpages for both are almost identical.


We both use dbt as the data modeling layer but we don't actually develop a standalone BI tool. Instead, we integrate to third-party BI tools such as Data Studio, Tableau, Metabase, etc.

We love Looker and wanted bring the LookML experience to existing BI tools rather than introducing a new BI tool, that's how metriql was born. I believe that Lightdash is a cool project especially for data analysts who are extensively using dbt but metriql targets users who are already using a BI tool. I'm not particularly sure which pages are identical, can you please point me?


Compare https://metriql.com/introduction/creating-datasets and https://docs.lightdash.com/guides/how-to-create-metrics

I though you are affiliated somehow, but looking at it now, it seems you just use the same documentation website generator :)


Ah, that makes sense! We use https://docusaurus.io/ primarily because dbt uses it as well. :)

It would be great if we can team up to build an open specification for the metric definitions though.


Did you mean to write HAVING in your first query? Otherwise your second query is not equivalent to the first, because the WHERE will not be performed prior to the aggregation.


The author mentions the Codd-OLAP conflict of interest, lamenting that despite what happened "Codd is still regarded as the father of the relational database". But this remark sounds like a non sequitur to me. No matter the scandal he got into in the 90s, he still invented relational algebra back in the 70s, didn't he?


the “scandal”

> coined the term Online analytical processing (OLAP) and wrote the "twelve laws of online analytical processing". Controversy erupted, however, after it was discovered that this paper had been sponsored by Arbor Software (subsequently Hyperion, now acquired by Oracle), a conflict of interest that had not been disclosed, and Computerworld withdrew the paper.

https://en.m.wikipedia.org/wiki/Edgar_F._Codd


He published a scientific paper that was to some degree paid advertising, and didn't disclose the payment.

That kind of thing still totally counts as a scandal.


Indeed, but was the paper wrong?


It doesn't matter. The undisclosed conflict of interest should eliminate it from consideration.


OLAP Cube is not dead, as long as you use some form of:

1. GROUP BY multiple fields (your dimensions), or

2. partition/shuffle/reduceByKey

and materializing/caching the result of this query for later reuse, you are rolling your own OLAP Cube with whatever data processing ecosystem you have on hands.

The technology itself has clear use cases and incredible business value on a daily basis, only implementation details differ depending on surroinding ecosystem


You can pry the star schema database from my cold dead hands. You can get a lot of work done, without much effort, in multi-dimensional row stores with bare SQL. The ETL's are easier to write than the map reduce jobs I've seen on columnar stores. ETL pipelines get the data in the structure I need before I begin analysis. ELT requires me to know too much about the original structure of the data. Sure, that's useful when you find a novel case, but shouldn't that be folded back into the data warehouse for everyone?


>Sure, that's useful when you find a novel case, but shouldn't that be folded back into the data warehouse for everyone?

And this isn't possible to do in the data warehouse why exactly? Most every company seems to use DBT so even analysts can write transforms to generate tables and views in the warehouse from other tables. Hell, even Fivetran lets you run DBT or transformations after loading data.


I absolutely love Looker for this reason. It understands foreign keys so if you model everything as stars it just writes SQL for you.

So simple, so powerful. I wish something open source would do this painfully obvious thing.


Columnar stores speak SQL these days, though they may map-reduce under the hood. ELT vs ETL just moves some transformation into SQL the data normally still ends up in the data warehouse.


While I agree about your core point (star schema ETL is great at most DW use cases due to simplicity and breadth of roles supported), I fail to understand how columnar stores are any more difficult to work with. My team routinely joins over trillions of records using standard SQL queries without any issue, all the map-reduce happens behind the scenes. In fact, it has been so successful for us that we have replaced almost all of our historical ETL infrastructure based on Spark with a single columnar store; while infrastructure costs have gone up (generally the price to pay for generalization), we have gained a great deal of human efficiency and DW maintainability.


I always felt the value of an OLAP cube was uncovering questions you may not know to ask. If you’ve got a cube with a good interface, it means more people in your organization can inspect the data looking for trends across dimensions that might not be obvious. As an example in a previous organization, someone just playing with an olap cube once found that certain DUI searches became much more common on Weekend mornings (perhaps obvious in retrospect). That insight allowed better targeting and sales, and made concrete what otherwise is just a guess.

Sometimes just exploring the data allows stuff like that to pop out.

Are there similar interfaces with columnar stores? Or do all the analytics need to be pre compiled? The ability to slice/dice/filter/aggregate your data in perhaps non obvious ways is really the value of business analytics in my opinion.


I don’t think this benefit is unique to OLAP cubes. This is just generally the reason you have a data warehouse at all. The implementation is up to you but if you model the business process then you will be ready to answer questions that have not been asked.


But are there tools that allow non developers to analyze the data ad-hoc? If there’s a gatekeeper in either direction than it really slows down the process of understanding the data.

Cubes usually allow ad-hoc real-time queries that business users can play around with and explore.


That definition doesn’t align with my experience or perception. Maybe I am wrong. Microsoft definitely sells a suite of tools for creating OLAP Cubes and presenting the data.

What you are describing is just a data warehouse with a reporting tool. There’s dozens of options for building something like that.

A data warehouse is always “OLAP” but it isn’t often a “cube”.


The difference as I understand it is that a cube specifically is the multidimensional schema definition of your data set. So for instance typically in an OLTP system you’d have customer data, product data, sales data, performance data, etc. Usually that data is in separate tables.

OLAP is the combined methods that allow you to analyze the data, roll it up, aggregate it, combine it across dimensions etc. Usually by using a star schema.

A OLAP Cube, then, is the full multidimensional schema and relations of all the data. Typically a cube lets you define how each type of field is aggregated, whether it’s summing count fields, averaging dollar fields, or special rules for date fields etc. In addition you specify hierarchies of the data, so for instance City belongs to State belongs to Country, etc.

Once it’s all defined and the ETLs transfer your OLTP data into the OLAP system, then the cube allows for complicated ad-hoc queries. An example is a spreadsheet like cube explorer that allowed you to “slice”, “dice”, “drill”, and “pivot” in real time.

https://en.wikipedia.org/wiki/OLAP_cube#Operations

The queries are run by the tool as you change the “cube spreadsheet” real time. Typically aggregates are precomputed to speed things up.

So as I understand columnar data systems, especially if we’re talking not defining cube multidimensional schema, then some database engineer needs to program each view of the data as requested by the consumers of the data.

So as I understood the article, columnar databases let you get away with all the slicing and dicing etc in a performant way due to the database architecture and speed. Without defining the multidimensional cube schema upfront. The problem is, that seems to me to mean, those definitions are just being created later by the query designers, instead of upfront. I haven’t seen, nor does the article talk about any tools that can replace the tools that OLAP cube systems typically have that let you take advantage of that predesigned cube schema.

Here’s a single example called JPivot in action. The is running MDX queries in the cube as it’s manipulated:

https://youtu.be/dLccogT_phY


Whatever approach you take, someone or something has to build a model of the data for end users to slice and dice.

The bit you’re missing is that a “cube” usually refers to a separate data store and compute engine to the data warehouse, queried via a different language such as MDX, typically meaning a proprietary tool is now required to get access. Usually this is also a subset of the data i.e. the cube designer has to decide what data is not included in the cube.

The people saying they don’t see the value of cubes are essentially saying they prefer multidimensional analysis on directly on top of a different stack: a relational, OLAP, columnar data warehouse, queried via SQL, with access to all the data.

It looks like the article is written by a vendor of such tools and hosted on their website, and others are mentioned throughout the thread.


>If you’re running a query like “give me the average price of all transactions over the past 5 years”, a relational database would have to load all the rows from the previous 5 years even though it merely wants to aggregate the price field; a columnar database would only have to examine one column — the price column.

Wouldn't that be two columns?


> Wouldn't that be two columns?

Columnar databases can also have indices. If there is an index by date (or time) then you DB will know row range from the index and will read only price column within given row range. If there is no index by data it would be two columns, but it is still much less than a full row with many columns.


Yeah but it's nitpicky :)

Maybe "past 5 years" mean "over everything" and then it might technically be one column.

Or if you're using something like S3 with Parquet for "columnar database" and it is partitioned by date, then the date values are stored in metadata - so you would logically read two columns but physically only read one from storage. Same story for something like Redshift and using date as sortkey.


Oh, I'm definitely being nitpicky :D


If the number of transactions is small enough to query quickly, a simple SQL query will give you that. If the number of transactions is very large, querying a random sample of perhaps 1 in 1000 will give you reasonably accurate results. You can calculate the probable error from the sample size.

There are questions which require examining all the data, but often you don't really need to.


I remember on a dark gloomy day going to the bookstore and picking up a book about SQL server report server. Some of the toughest time in my life.


I just spent the first 6 months on a new reporting product ripping out SSRS because the engineers couldn’t get performance within 2 orders of magnitude of a native implementation on top of SQL Server. :(


One past thread:

The Rise and Fall of the OLAP Cube - https://news.ycombinator.com/item?id=22189178 - Jan 2020 (53 comments)


I never liked OLAP cubes because they discouraged ad-hoc data exploration since if there wasn't already a cube that modeled the data a certain way you couldn't do it. You had to develop it, or hope a DBA had time to develop it sometime in the next two months. They were fine if you already knew the questions you needed answered, but they probably contributed to some companies focussing on the wrong things because they couldn't easily explore data for new insights.

I much prefer the current setup I work under: We have an ODS with a whole bunch of denormalized views for the most commonly needed data built on incremental updates to a clone of the production database. If one of the views doesn't have the data required or it's not structured as required I can simply bring one of the cloned prod tables into my SQL query.

If I find myself repeating that process for the same data often, I create another view for it, but I always have all of the data needed for any analysis or exploration.

If I need up-to-the-second data I can write the boilerplate using ODS and cloned prod, then join to only live prod data required.

Things like the added complexity of a constructing a pivot table in SQL are a non issue. Report writing software handles that and with a lot more flexibility to the end user for interaction.


This is why sometimes you end up with BOTH a traditional ETL data warehouse, for common, structured reporting and analytics and also an unstructured data lake(s) that pull raw data in from many sources to allow for ad-hoc exploration.

But like others have said here, eventually you might find a novel pattern or something worth tracking on a regular, go-forward basis, and will want it to built into the data warehouse, for all of the various benefits a data warehouse provides for everything else it does so well.


At previous job, in an an human ads evaluation team, we had a big row table that got turned into column oriented one every couple of hours. I think it got close to million columns (expanded from leaf protobuf fields). We had some black/white listing as the schema grew too big (~20GB? just the schema, and the limits were few gigs). That was years ago, so probably changed a lot.

But the use case was for various analysts (statisticians, linguists, etc., or any internal users other teams of the table). As their quires might use any column they like - there was probably a way to track down which columns got used over a time, but since it was used for launching new things, you would not know - better be safe than sorry I guess.

Anyway, I'm still fascinated that this terabyte thing got converted using 20k machines for few hours and worked! (it wasn't streaming, it was converting over and over from the begining, this might've changed too!)


Brings up nostalgic memories from my previous life building distributed analytics systems. We had a fully distributed, scalable, real-time OLAP cube 10 years ago on top of HBase and Storm (or any other realtime processing pipeline). It was less than 30 KLOC and could be used as a library as well as fully distributed system.

There are several OSS solutions but I still weep every time I think about the stupid reasons for which we delayed open-sourcing it. It was such a simple and scalable data model including many concepts that were not common at the time. https://medium.com/hstackdotorg/hbasecon-low-latency-olap-wi...

I still believe there's room for a similar simple approach, likely taking advantage of some of the developments related to Algebird, Calcite, Gandiva, etc.


Related question; does anyone use JVM-based OLAP that is not based on Mondrian (https://github.com/pentaho/mondrian) but still supports MDX queries? We've been looking for alternatives for a while.


I've been doing OLAP for over 20 years, so I can say with authority that OLAP is generally a bag of hurt. The tooling sucks, the languages suck, the costs suck. The name itself harks back to seemingly a time before computers. "Online"? As opposed to paper tablets?


It's far less online contrasted to what, than analytic processing contrasted to transaction processing.

OLAP was a departure from OLTP, online transaction processing, the mainstay of business transactional networks. And of course, both are contrasted to batch processing which occurs later or after the fact, though in practice, much OLAP was specific triggered reports or ad hoc analysis, at least from the world as I saw it.

Where OLTP and OLAP differ is in that OLTP deals with a transaction at a time, typically following a single entity or account through a single transaction or operation. The data loads are small though the transaction counts are high, and there's virtually no summation of information. Database optimisation is built around these needs.

OLAP, in theory, was based around realtime or near-realitime access to transactional data, but the analysis looked across entities (typically accounts, products, divisions, regions, time, etc.), and relied heavily on summarised data (counts, sums, means, medians, min, max, standard deviation, occasionally other metrics). This was typically rationalised by either defining views, or (more often in my experience) by precomputing what were thought to be sufficient and useful summary statistics. (In practice, neither ambition ever seemed to be reliably met, requiring returning to source or disaggregated data.)

From a statistical stadpoint, one of the chief failures of OLAP is that it attempts to substitute summarisation for sampling. In practice, a small sample of a very large dataset is quite frequently orders of magnitude more easily compiled and analysed, with very little loss of information. Even aggregate computations can often be closely estimated by this method.

The article explicitly makes most of these points, FWIW.


Online as opposed to batch.


I know. But still, you must agree, a stupid name. The rest of the IT industry would say "real-time vs batch". And OLAP is batch for most implementations.


Real-time is engine control software not a webpage with a spinner.


The name was coined way before "online" meant what it means today.


OLAP is legacy hell. I built an MDX query to solve a report problem, but now I can't replace it because my seniors failed to deconstruct it because it is stupid difficult to wrap your head around. So now it sits in limbo until I can find time to do the work myself. Let me just take a look at my calendar and find my next free week... oh, there is none.


I keep feeling like this is room for a better language than MDX... And DAX isn't it either.


MDX is brilliant but challenging - especially if you don't work with it on a regular basis.


That's my experience for sure. I haven't thought all that hard about this but I wish there was a language lessened that wall of incomprehension I face every infrequent time I go back to MDX -- that's what I mean by room for another language.


A great deal of effort by very clever people went into MDX. Well one clever individual anyway - Mosha Pasumansky - whom I had the pleasure of once meeting. He moved on to working on BigQuery at Google - so yeah, MDX is dead.


I’m not a big fan of OLAP cubes / data warehouses but it should be noted that these aren’t always just about scale and speed. They also serve to provide a single view of the data when the data is held across multiple databases / systems. You’re orders may be in PostgreSQL, your inventory tracked in MySQL etc. So bringing all that data into a single database for analytics can be valuable.


This isn't against data warehouses, but just the cubes. Bringing everything into a single database for analysis is still required. But you extract load then transform, instead of extract transform load.


While that's true, it feels to me that your comment made this point a lot more clearly than the article did.


Which is why hadoop took off?


I am not the subject matter expert but have spent first few year of my career doing data warehousing. I think OLAP cubes are falling out of fashion because traditional BI and reporting in general is falling out of fashion.

With self-service BI tools like looker and other tools, Business folks are running more adhoc queries than traditionally piped data through ETL processes.


One thing about columnar store that annoys me, is that when you do want to retrieve the records for something you've found it's really slow. I wish there was something of a hybrid.


We have something similar to that at Singlestore (formerly memsql). We support secondary keys and compression schemes that are incremental so you can decompress a few rows out of our columnstore without needing to decompressing the data for other rows stored near by like a traditional columnstore (Redshift, snowflake, etc). We call this subsegment access [1].

[1] https://www.singlestore.com/blog/memsql-singlestore-then-the...


I’m pretty sure Snowflake is hybrid like you’re describing. It certainly performs like it. Check out their original white paper for the architecture details


My first exposure to OLAP was on a team developing a front end to Essbase that semi-automated report generation after reloading the cube. Since then, I have worked with various OLAP technologies from Microsoft to Mondrian.

I agree with Mr. Chin who is the author of this blog. OLAP cubes can scale up but not out which limits their potential size so they can't really accommodate big data scenarios. My previous company used Apache Druid instead. Druid can do a lot of what OLAP did but using a distributed approach. Although Druid does not provide tight integration with spreadsheets, it does provide near real-time ingestion, on-the-fly rollup, and first class time series analysis.


i think the article misses a big point parametric queries

cubes whether tabular (columnar database) or multidimensional provide a language mdx or dax, that basically allows parametric measures (or queries)

which is essential in data analysis

sql for now, doesnt allow parametric functions or queries in the same way dax or mdx does


What about the MDX language, is it still relevant today ? What are you using as OLAP tools ?


We still use MDX. Did some testing with DAX, but I don't believe its step in the right direction. As for tools, we mostly use in-house tools build either on Mondrian or connecting to MS Analysis Services trough XMLA - depending on the use case.


One big table combining all of your facts and dimensions for me.

Less compute time, less time for business users since they don’t need to figure out star schema joins. Storage is cheap.


I would say yes it's now possible to do analytics without cubes, except when butting up against two things:

* Really big data (~ PB/day), which makes it impractical to process every analytics query by going back to the source

* User data subject to GDPR-like retention constraints, where you cannot usually hold on to the base data for ad long as you need analytics for


I'm not sure we can call NoSQL "a fad that has gone away".


I think the original author was referring to the wave of popularity a few years ago when Mongo / redis / cassandra/ etc were getting tons of press, "SQL is dead" was the going phrase, and document pre-joins and intense searching of object/documents via map-reduce (elastic, et al) was the metaphor du jour.

Like every other hype, it settled down to a realistic set of use cases that NoSQL databases rock at, and the world recognized that other use cases weren't their thing, and you stopped hearing about the death of SQL, the expectation that you'd never need a SQL DB or join again, etc.

So, not a fad, but past the hype-beyond-belief stage and now into the "gets stuff done" stage.


OLAP is not SQL or NoSQL in any way, shape, or form. Completely separate concept and completely separate technologies.


Indeed. Yet TFA insists on referring to NoSQL as "a fad" in its opening.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: