Un título de gráfico

What is Qlik In-Memory Processing?

Autor: Ryuma Nakano


Introduction

The functioning of a computer depends on many hardware components to do its job, such as the processor, power supply, video card, RAM, hard drive, motherboard, network card, audio interface, and many more. Each of these components has particular specifications that provide specific services when turning on and working with our computer. 

In this case, we will be reviewing the storage systems necessary for a computer to function and allow us to perform data analysis tasks optimally, including Cache memory, ROM (Read-Only Memory), RAM (Random Access Memory), and secondary storage. 

Once we understand these structures and their basic functions, we will seek to understand what in-memory processing is and its implications in analytics. 


Types of Memory 

By definition, a memory is any device whose main task is to store data to be accessed by external tasks or processes. 


Cache Memory 

This is volatile memory, meaning that everything stored in it will be erased when the computer loses its power supply. This memory is used directly by the processor and allows it to store the data and tasks that need to be accessed quickly for the execution of its processes. Its great advantage is the response speed, and its major disadvantages are the high acquisition cost in addition to the small amount of data that can be stored in it and the fact that it cannot be increased over time.. 

Here are the specifications Here are the  



ROM (Read-Only Memory)

This is permanent memory and tends to change very little over time. In this type of memory, vital configurations that do not need to be modified constantly are stored, such as the BIOS, firmware, and others. In our particular case of data analysis, this memory does not impact us except for the configurations dictated by the manufacturer. 


Random Access Memory (RAM) 

This is volatile memory that the computer uses to load and process all the software we use on it, from the operating system to image processing software, data storage systems, programming software, and everything that allows us to interact with and leverage its computing capabilities. In the past, before the 1990s, this type of memory used to have limited capacity and high cost. However, today these limitations have significantly decreased, and we depend more on the physical space and configurations of our machines to be able to increase its size.  

Here are the specifications  of a Crucial DDR4 32GB RAM (2 x 16GB):

As we can see, the storage capacity is thousands of times greater compared to Cache memory; however, its speed is also thousands of times slower.


Secondary Storage  

This is the most common type of memory, as it is used daily to store all data, files, images, videos, and more. This includes memory cards (SD, microSD, etc.), USB/Flash drives, optical disc drives (CD, DVD, Blu-Ray), cloud storage, hard drives (HDD), and solid-state drives (SSD). These types of memory offer the largest storage capacities on the market in relation to the cost invested.  

We will use the specifications  of a Kingston KC2500 1TB SSD since it offers the best performance compared to other types of secondary storage.



As we can see, if we compare the storage capacity of the SSD with RAM, we can say it is thousands of times greater and even more, however, its speed is also thousands of times slower.. 


In-Memory Processing

In-memory processing refers to the ability of Qlik Sense / QlikView to load data optimally, meaning it can work with larger volumes of data using smaller RAM spaces. In the past, this processing was done directly on the hard drive, which had its advantages, such as working with "high" volumes for the time (1990s) at the cost of slower query speeds, as we saw in the previous section.  

Initially, many manufacturers believed that to perform in-memory processing, data simply needed to be loaded in the same way it was stored on the hard drive, without considering the significant difference in storage between RAM and the hard drive, leading to excessive RAM demand that systems of the time were unable to meet.


Data and Storage

Computers are very good at performing mathematical operations and following instructions (algorithms); however, the alphabet they use to communicate is not similar to the one we use. They use a binary alphabet, meaning they only understand zeros and ones, and with that, they interpret everything we try to tell them. 

So, let's think of each zero (0) and one (1) as having a unit weight within our memories, which we will call a bit. In other words, binary 0 weighs 1 bit, binary 1 weighs 1 bit, binary 10 weighs 2 bits, binary 01010010 weighs 8 bits, and so on. Just as the metric system has a standard for interpreting distances, the binary system also has one:

Unit 
Equivalence 

1 bit (b) 

0 o 1 

1 Byte (B) 

8 bits (b) 

1 KiloByte (KB) 

1024 Bytes (B) 

1 MegaByte (MB) 

1024 KiloBytes (KB) 

1 GigaByte (GB) 

1024 MegaBytes (MB) 

1 TeraByte (TB) 

1024 GigaBytes (GB) 


Now that we understand how a computer stores its binary data, how does it interpret data written by a human? Let's define data as a numerical value or a sequence of character letters, special symbols, spaces, and numbers (more information here). 

Let's start by reviewing numerical data. These are quite easy to understand as they work like our decimal numeric system, which, as its name implies, is base 10, to a base 2 system. Let's look at an example..


This means that the decimal number 155 is equal to the binary number 10011011, and its weight in memory is 8 bits, which is equivalent to 1 Byte.  

In the case of characters, each letter is assigned a number in decimal base, and this is stored as its binary equivalent. 

Letter 
# Decimal 
# Binary 

10 

… 

… 

… 

24 

11000 

25 

11001 

26 

11010 

According to this example, we could define a standard for writing with a lowercase alphabet, requiring each letter to weigh between 1 bit and 5 bits. However, this could create confusion because if we receive the binary 11010, we might think it is the letter 'z' or it could refer to the combination of binaries 1 10 1 0, which would correspond to 'bcba'. For this reason, we define that for our standard, each character must weigh 5 bits, and we could then identify our second case as 00001 00010 00001 00000.  

In reality, these standards already exist, and we will use the 8-bit ASCII standard, which defines that each character has a weight of 1 Byte (8 Bits). Let's think of a column in a database table that we will call Color. 

Color 
Weight in bytes 

RED 

4 Bytes 

YELLOW 

9 Bytes 

YELLOW 

9 Bytes 

YELLOW 

9 Bytes 

BLUE 

4 Bytes 

RED 

4 Bytes 

We can see that some colors are repeated and that each of these repetitions has the same weight as the original. To calculate the weight of this column, we only need to add all its weights.  


Indexing

Once we understand how column storage works in memory, let's analyze how Qlik changed this paradigm, as in this example we are ignoring certain restrictions that exist in the real world, which further increase the weight of these values, such as data types in a database (Varchar, Varchar2), volumetry, among others. 

Indexing is the action of creating an index or numerical identifier for each value, so they can be found more efficiently, as we saw that numbers are easier to handle than character strings. 

Qlik solved this problem using binary indexing. Each column loaded in QlikView, Qlik Sense, or their QVD, QVF, or QVW files, such as the Color column from the previous example, generates a master table with the distinct values (YELLOW, BLUE, RED) and a binary index according to the number of distinct values.

Idx_Color 
Desc_Color 
Weight in bytes 

00 

YELLOW 

2 bits + 9 Bytes 

01 

BLUE 

2 bits + 4 Bytes 

10 

RED 

2 bits + 4 Bytes 

Once this process is completed, the original column replaces the values it initially had with the new binary indexes created: 

Color 
Weight in bytes 

10 

2 bits 

00 

2 bits 

00 

2 bits 

00 

2 bits 

01 

2 bits 

10 

2 bits 

With these values, we calculate the total weight of our new Color column as follows: 

As we can see, despite having more data structures and additional processes, in this example, we went from using 312 bits in our original case to 148 bits in the indexed case, which means a 52.6% savings in memory space. If scaled to large volumes of data, this is a significant improvement.


Conclusion 

From this exercise, we can see that the more repeated values a column has, the better the data compression. Similarly, if all values in a column are different, it is very likely that they will weigh a little more than their original source.

What did you think? Share your opinion now!