Interest in and development of in-memory technologies have increased over the last few years, driven in part by widespread availability of affordable 64-bit hardware and operating systems and the performance advantages in-memory operations provide over disk-based operations. Some software vendors, such as SAP with its High-Performance Analytic Appliance (HANA) project has been advancing with momentum, have even suggested that we can put our entire analytic systems in memory.
I hope it will be helpful to take a look at what an “in-memory” system is, what it is good for and what some of the concerns about it are. First of all, nearly all systems involve some combination of memory and disk operations, but the roles each of these plays may differ. The fundamental value proposition relates to the greater speed of memory-based operations vs. disk-based input/output (I/O) operations. It is easy to understand that computer operations in memory can be significantly faster than any operation involving I/O. Many types of system performance have been enhanced by leveraging memory in the form of caches. If information can be retrieved from a cache rather than the disk, the operation will complete more quickly.
What types of applications can benefit from in-memory technology? Very fast, high-volume transaction processing can be accomplished using one type of in-memory technology. Examples include IBM solidDB, Oracle TimesTen, Membase and VoltDB. Complex event processing (CEP) is another type of in-memory system. Examples of CEP include IBM, Progress Software’s Apama, Streambase, Sybase Aleri recently bought by SAP and also Vitria. Other types of analytics can be performed in-memory, including more conventional query and analysis of historical data. Beyond SAP is QlikView who I recently assessed, Tibco Spotfire and now Tableau <link to link to new blog>. All of these systems deal with historical data. Another category of in-memory systems involves forward-looking calculations, models and simulations. Examples include IBM Cognos TM1 and Quantrix who my colleague recently covered (See: “Quantrix Gets Pushy with Plans”).
Over the years database performance has been greatly improved by advances in caching schemes. A logical extension of caching might be to put the entire database in memory and eliminate any disk-based operations. Well, it’s not quite that simple. There are some complexities, such as recoverability, that must be dealt with when the system is entirely in memory. I suspect you’ve heard the term “ACID compliant”; the “D” stands for durability. It represents the notion that a transaction once committed will be durable or permanently recorded. Without creating a copy of the transaction somewhere other than in the memory of the affected system, you can’t recover the transaction and therefore cannot provide durability. Even in analytical systems the notion of durability is important because you need to be able to ensure that the data was loaded properly into the system.
I’ve seen three schemes for dealing with the durability issue. Each has advantages and challenges:
1) Write data to disk as well as putting it in memory. The challenge here is whether you can write to the disk fast enough to keep up with the data that is being loaded into memory.
2) Put the data in memory on two different machines. The risk here is if both machines go down, you lose the data.
3) Use a combination of #1 and #2 above. Putting data in-memory on two machines provides some level of protection that allows time for a background process or asynchronous process to write data out to disk. In this case you need to understand what scheme a vendor is using and whether it meets your service level agreements.
In some streaming applications the history and recoverability are left to other systems (such as the system of record) and the operations on the streaming data are allowed to operate “without a net,” so to speak. This method assumes that if the system goes down, you can live without the data that was in transit – either because it will be recovered elsewhere or because it wasn’t important enough to keep. An example might be stock-trading data being analyzed with an in-memory complex event processing system. If the CEP system crashes, the quotes being analyzed could be recovered from the exchange that generated them.
Another issue is that memory is much more expensive than spinning disk. In considering the enormous and ever-increasing volumes of data produced and consumed in the world of analytics, cost could be a significant obstacle. In the future, cost structures may change, but for the near term, memory still exacts a premium relative to the same quantity of disk storage. As a result memory-based systems need to be as efficient as possible and fit as much data as possible into memory. Toward this end, many in-memory analytic systems use columnar representation because it offers a compact representation of the data. Thus the key issue here as you compare vendors is to understand how much memory each requires to represent your data. Remember to take into consideration temp space or working space for each user.
I think the technology market understands accelerating DBMS and CEP operations as we found in our benchmark research on Operational Intelligence and Complex Event Processing, but I doubt that it fully understands how in-memory technology can transform calculations, modeling and simulations over large amounts of data. Today’s CPUs are very good (and fast) at performing calculations, as millions of users know from their work with spreadsheets. An overwhelming majority (84%) of organizations in our benchmark research on BI and performance management said it is important or very important to add planning and forecasting to their efforts, and these activities are calculation-intensive.
Applying spreadsheet-style calculations to large amounts of data is a challenge. Often you run out of memory or performance is so poor that it is unusable. Relational databases are another obstacle. Performing spreadsheet-type calculations on data in relational databases is difficult because each row in an RDBMS is independent from every other row. In performing simple interrow calculations – for example, computing next year’s projected sales as a function of this year’s sales – you could be accessing two entirely different portions of the database and therefore different portions of the disk. Multiply that simple example by hundreds or thousands of formulas needed to model your business operations and you can see how you might have to have the whole database in-memory to get reasonable performance. Assuming you can fit the data in-memory, these types of seemingly random calculation dependences can be handled easily and efficiently. Remember, RAM stands for random-access memory.
The next consideration is that most in-memory databases used for analytics do not scale across machines. SAP’s HANA may change some of that. Tibco’s ActiveSpaces has promise as well. I’ll be interested to see who tackles this challenge first, but I don’t believe the internode communications infrastructure exists yet to make random access of data across nodes feasible. So for the time being, calculation models will most likely need to be confined to data located on a single machine to deliver reasonable performance. It’s clear that in-memory databases can provide needed benefits, but they will have to handle these challenges before wide adoption becomes likely.
Let me know your thoughtsor come and collaborate with me on Facebook, LinkedInand Twitter .