The Price of Data Coherency and the Cost of Simulated Joins (Part 1)

In the last few posts, I compared the insertion performance of MongoDB, Memcached, and MySQL under sequential, bulk, threaded, and parallel operations. MongoDB had superior performance, but it came with price – it doesn’t support table locking and might cause many headaches to people who use it in a distributed system.  It also doesn’t support joins. So then, what is the price of data coherency and the cost of doing a simulated join in MongoDB compare to MySQL?

Data coherency is similar to cache coherency where the consistency of the data with respect to the query operation can be out of synced or won’t save correctly in a non-ACID compliant database. For example, suppose a process sends an entry X to the mongo database server. A few microseconds later, a different process pings it to see whether entry X is in the database, receives nothing in return, and attempts to insert the same entry into the database. How often does this occur? Because of this problem, many data-critical transaction systems like financial banking stay away from non-ACID compliant databases for very good reasons.

The second issue is that MongoDB doesn’t support joins where complex queries on data cannot be easily done at the level of writing a few lines of SQL code. Joins are power operations that manipulate rows across multiple tables with optimized efficiency. One could import the data from MongoDB to MySQL and do a join in MySQL, or could write functional code to do a join using a scripting language like Python. How much slower is doing a simulated join in a scripting language like Python compare to an optimized join in MySQL?

To answer the first question, I design the following experiment. There are 54 processes attempting to insert n entries sequentially in the MongoDB server.  Each process has a buffer of the same n entries in the same order. An entry will consist of a key and some randomly generated data with a fixed size of y chars. Before a process proceeds to insert the data into the MongoDB server, it will ask whether there is an existing entry by looking up the key. If the key doesn’t exist, it will go ahead and insert it. The column is n, the number of entries a process has to insert, the rows are the number of characters of each entry, and the element is the number of duplicates of doing parallel insertion in a non-ACID compliant database.

Size of Entry (ROW)

Total # of Entries (COL)

1024

2048

4096

1024

589

685

1099

2048

634

922

461

4096

562

430

853

Thank goodness as the number of entries exponentially increases, the number of duplicates increases polynomially. Surprisingly, I would expect to see the number of duplicates increases as the size of each entry increases since it would take longer time to insert and that might cause more non-atomic transaction.

(To be continued)

Parallel and Threaded Insertions – MySQL, MongoDB, and Memcached Performance (Part 3)

This post is a continuation of comparing the insert performance of MySQL, MongoDB, and Memcached. In the last two posts [1] [2], I compared how these storage engines would perform under sequential, bulk, and delayed inserts. In this post, I will continue the same experiment for threaded and parallel inserts under stressed conditions.

For many data-intensive applications, one needs a storage engine that is capable of handling multiple requests simultaneously. An example is an architecture of multiple processors using a centralized storage for fetching and storing data. Memcached was designed to reduce the workload of a database server where queries are cached into a memory cluster. MySQL was designed for structured and relational data, and MongoDB is somewhere in between.

Threads are light weight processes that execute concurrently and share resources like memory, I/O, and data structures. On the other hand, processes are their own logical computing structures that contain virtual and physical resources like sockets, file handles, memory, and so on. This is crucial for parallel and threaded inserts since most modern data engines have a limitation of the maximum number of open connections. The default is around 100, 16, 1024 for MySQL, MongoDB, and Memcached respectively. These numbers don’t mean much, since one can easily tune them in the configuration files to increase performance.

What’s important is the physical limitation of the maximum number of connections and the overhead of context switching. First, every connection requires a number of file descriptors and capped by the operating system. For a high end sever, it could range from 50K to 100K. Second, a logical connection is a process in execution. With too many processes executing, overhead arises from context switching, which is computationally expensive in addition to flushing the cache(s). In one extreme case, the operating system could spend all of its time on switching processes on processors rather than letting them execute their tasks.

I use mpi4py for the purpose of scheduling processes across a cluster and python threads, a higher level interface that simulates kernel threads. Here are the results of inserting rows & cols of matrices into the following data storage engines:

Parallel & Threaded Inserts in MongoDB with Sequential Insertions

Matrix-Size = 512

        Processes = 2^4                      2^5                                 2^6

Threads = 2^0

Threads = 2^2

Threads = 2^3

0.11(ms)

0.11

0.09

0.26

0.22

0.15

0.14

0.15

0.15

Parallel & Threaded Inserts in MySQL using InnoDB for Higher Concurrency with Row Locks

Matrix-Size = 512

         Processes = 2^4                      2^5                                  2^6

Threads = 2^0

5.72(ms)

5.37

5.35

Parallel & Threaded Inserts in Memcached with Sequential Insertions

Matrix-Size = 512

         Processes = 2^4                      2^5                                   2^6

Threads = 2^0

0.19(ms)

0.19

0.19

First, It seems like bulk insertions have higher improvements than threaded insertions. This makes sense since the data storage does not get jammed with too many concurrent requests. Second, more processes increases throughput resources (12-core server) are being utilized. Third, I don’t know what happened at threads=2^2, since I expect the average time (ms) it takes to insert a vector length of 2x(512) to be faster than with threads = 2^3.

For threaded insertions, I did not use bulk insert since I did nit want to implement local buffers for each thread to do a batch insert. One clear advantage of having threads is that they can share the database connection. Instead of having hundreds of processes, which translates to hundreds of potentially inactive database connections, one could implement threads to share a connection. This clearly will reduce overhead, how much of a performance gain is another question. Unfortunately, the python clients for Memcached and Mysql were not python-thread safe.   I found this out after trying to implement them and do the comparison.

Special Note: Threads in python are designed to utilize CPU with asynchronous IO. Python’s higher-level  implementation of threads cannot be schedule acrossed multi-core or processors. I don’t know whether it can be utilized with Intel Hyper-threading technology, but with AMD, all threads of a process share the same physical processor.

MySQL and MongoDB Performance Benchmarking (Part 2).

This weekly update is a continuation of benchmarking MySQL and MongoDB under various conditions for performance purposes. For readers who have missed the first part, please take a look at this post for some background information and to understand my motivation.

Loading performance is important because loading data takes up a lot of time for many data-driven applications. For some applications, loading could mean take the data from the text file and store it in the disk. For other applications, it could mean to take the data from the disk and load it into memory. For people who might need to process a TB of data or more, storing it in disk using NoSQL for unstructured data and traditional SQL for structured data is maybe sufficient.

For data that is smaller, storing it in memory using Memcached is probably sufficient, assuming that the data is also stored in a text file. Using Memcached, one can easily edit and re-run an algorithm without having to reprocess and reload the data from the text file. How fast is storing the data in Memcached compare to MySQL and MongoDB? Is it possible that MongoDB is actually faster when loading data through a client like Python?

With MySQL, I neither did no tuning in the configuration file nor test each of the 10 or more storage engines. One reader has suggested I test it using the memory engine and with a large query cache size. I think it is a good idea. However, the memory engine doesn’t support blob/text data field, and comparing a disk engine in MongoDB to a memory engine in MySQL sounds unfair.  Let’s see the results from the previous sequential insertion with addition to Memcached.

 

DBs / Matrix Sizes

128

256

512

Avg. (ms)

Total (s)

Avg. (ms)

Total (s)

Avg (ms)

Total (s)

MySQL w/o

modification

2.44

40

5.89

386

 9.60

 2517

MongoDB

0.18

3

0.31

20

0.53

140

Memcached

1.22

20

1.53

100

2.28

597

 

Surprisingly,  MongoDB is much faster than MySQL when inserting 2^14 records. How much faster? About 14 and 7 times faster compare to MySQL and Memcached respectively. One might ask how is it possible that MongoDB is faster than Memcached? When inserting data into Memcached, the least recently used algorithm updates the count to select a victim for deletion once the cache reaches the memory limit. The trends of bulk insertion are the same as sequential inserts. I will investigate how MongoDB loads data.

Bulk Insertion is the process of inserting multiple entries into the database at the same time. For some clients, bulk insertion is just a wrapper for sequential insertion. For others, bulk insertion could be implemented optimally knowing the architecture of TCP and file buffers. I didn’t investigate how these databases implement bulk insertion. Comparing the results, I noticed that bulk insertion is overall faster than sequential insertion for all databases and including Memcached.

 

DBs / Matrix Sizes

128

256

512

Avg. (ms)

Total (s)

Avg. (ms)

Total (s)

Avg (ms)

Total (s)

MySQL w/ large query cache & delayed insert

2.01

33

3.60

236

6.41

1682

MongoDB

0.12

2

0.21

14

0.42

110

Memcached

0.48

8

0.92

60

1.65

433

 

Average Speedup Matrix for Sequential and Bulk Inserts:

The average speedup is defined by taking the average of the ratio of database X and Y for all cases. On average with the sequential and bulk inserts, MongoDB is 16 times faster than MySQL and 5 times faster than Memcached.  Memcached is on average 4 times faster than MySQL.

 

MySQL

Memcached

MongoDB

17

5

Memcached

                   4

-

 

Multithreaded Insertion:

Next week, I will run a comparison using threads to do parallel inserts. I also try to see if it is possible to implement asynchronous inserts for MongoDB/MySQL/Memcached. In the near future, I will benchmark distributed or clustered memory with Memcached and sharded data with MongoDB. I think I will skip MySQL cluster, since setting it up is more painful than looking at the results.

Some additional notes.

1)      The memory engine in MySQL doesn’t support BLOB/TEXT. The table will be saved to disk, but a reboot will clear the entries.

2)      MongoDB has a size limitation of 16MB. I wish there was an internal mechanism that could link two or more data structures into one logical record, so the size limitation doesn’t become a problem for general data-driven applications.

3)      Memcached has a key size limit of 250 bytes, which is also a pain.

4)      The python client for Memcached and MongoDB are easier to use than the one for MySQL only because of the additional SQL syntax.

5)      Distributed memcached and MongoDB aren’t stable when inserting MBs of data. For instance, connection gets timed out for large entries. In this comparison, I do a bulk insert once I reach 10KB of data. Trying to do a bulk insert for a MB of data will most likely cause a timeout. Please note that the data isn’t being transferred across the network.

6)      InnoDB is a transactional engine, so it doesn’t support delayed inserts.

MySQL and MongoDB Performance Comparison Part 1

One quick way to horizontally scale out computation is to use a database as a passive task scheduler for randomly sending unprocessed tasks to compute nodes. Given N processes and K tasks, where K >> N, collision is minimal. For some algorithms where K strictly decreases over time, this approach increases redundant computations when K is close or smaller than N.  For other algorithms like distributed web crawling, one can argue  K remains much greater than N because web links are continually being added into the processing queue.

I have been interested in looking for a centralized database solution that can handle a large incoming workload of processes fetching unprocessed tasks. Two popular database solutions are MySQL and MongoDB. One is a good representation of the traditional RDBMS and the other one is a popular choice from the NoSQL movement. Which one is suitable for this task and how much faster? How much tuning do I have to do in order to speed it up?

I’m using two algorithms for comparing MySQL to MongoDB. The first one is a normal matrix multiplication where two large matrices are stored in a table and processes randomly fetch rows and columns to multiply. The second one is a simple distributed WebCrawler where links are tasks and processes fetch links to retrieve the corresponding HTML. The nature of these two problems is in the amount of data being transferred – with the matrix multiplication, a task is inherently large and causes traffic congestion. On the other hand, the distributed web crawler has little congestion because a link is relatively small.

Beside loading the database and creating appropriate indexes, I try not to change too much of the default configuration to see how does these two databases perform. The following is the work flow of the matrix multiplication and distributed web crawling.

1)      Fetch unprocessed task [row/col or url]

2)      Execute the task [multiple the row/col or fetch html]

3)      Store the results into the DB

4)      Mark task as processed.

All 9 machines I’m using are identical with 6 AMD 3.2Ghz processors or 54 cores total. The centralized DB server has 12 AMD cores and 64GB of RAM.

Schemas:

CREATE TABLE pMatrix (processed TINYINT(1), row MEDIUMTEXT, col MEDIUMTEXT, value INTEGER, i INTEGER, j INTEGER) ENGINE = InnoDB;

record = { ‘processed’ : boolean, ‘row’ : list of integers, ‘col’ : list of integers, ‘value’ : int, ‘i’ : int
, ‘j’ : in}

Loading Performance

This measures how fast loading the data into the database using MySQLdb and PyMongo clients for Python. The first part is to measure sequential inserts on a local machine without transferring the data across the network, and the second part is to measure bulk inserts.

DBs / Matrix Sizes

128

256

512

Avg. (ms)

Total (s)

Avg. (ms)

Total (s)

Avg (ms)

Total (s)

MySQL

2.44

40

5.89

286

1.94

508

MongoDB

0.18

3

0.31

20

0.53

140

Random Selection Performance

In this comparison, I measure the time it takes to fetch a random row in MySQL and a random record in MongoDB.

Rnum = random.randint(0, COUNT)

Qa = “SELECT row, col FROM pMATRIX WHERE processed = 0 LIMIT 1 OFFSET %d “ % Rnum

Qb  = “collection.find({‘processed’ : 0}).skip(Rnum).limit(1)

To be continued.

General Writing Advice in Science from my Adviser

Basically, writing is a bit more dry, you want to minimize the number of terms you use and use them consistently, just opposite what poets do. You also try to be consistent and “minimalistic” in notation, use the band of letters consistently, like n-q integers, u-z reals, if you use A-M for nodes, you do not use then a or c for nodes etc. This simply helps to digest difficult text.

… try to keep the flow of thoughts, from general to particular or opposite, but no jumping at levels, …. try to address issues once in a section.

The most read parts are abstract, intro/motivation and conclusions, so they warrant the most attention.
Therefore you have three chances to convince a reader that the paper is worth reading, so all three may contain the main point of the paper.

Distance impacts friendship.

I dicussed a little bit about how distance impacts friendship on RPI/CS’s blog is a small excerpt.

Distance impacts friendship in a fundamental way. Being far away will prevent people from being friends, but being close together does not by itself create friendship. While it is true that any two people could become friends anywhere at anytime, it becomes less likely when the distance between them increases. A personal example that I use is that most of my friends are from Boston, a place where I live for most of my life. Almost paradoxically, being close together does not by itself create friendship. I only happen to know a small fraction of people who live in Boston. Location-based social networking data was used to support this hypothesis.

Y-axis represents avg. distance apart and X-axis represents how often do people occur at the same time and place.

X-Y axis represent positions of how human and random mobility applies to a coordinate system.

The left figure displays mobility traces of RWP and our proposed friendship-based mobility model. The gray lines represent the trajectories of the RWP. The green lines represent the trajectories of our model using real data from a location-based social network. These mobile traces represent trajectories of a particular node moving from one point to the next, which are being represented by the lines. The two models are different in a fundamental way. First, the stationary distribution of the positions from the trajectories in the RWP is centralized at the middle. That is, if we randomly pick two points on a coordinate system, the line that connects the two points is more likely to cross the center. The stationary distribution of movements from our friendship-based mobility model reflects the fact the humans have a tendency to be around certain locations (e.g., home, school, work, etc.) and travel to places with less probability that are outside of our proximity (e.g., vacation, conferences, travel, etc).

On the right figure, there are 701 blue points that represent two randomly selected users who are friends and 620 red points that represent two randomly selected users who are not friends. The shaded region is drawn by using the k-nearest neighbor algorithm for classifying whether two users are friends given their average distance apart and checkin similarity, which means how often do two users occur at the same time and place. From the graph, we argue that distance impacts friendship in a fundamental way. First, we notice from the y-axis as distance increases, it becomes less likely to see friendship from two randomly selected users. Interestingly, being nearby distance does not influence friendship, which is represented by the x-axis. To conclude, long distances prevents people from being friends, but being close together doesn’t not by itself create friendship.

Using Location-Based Social Networks to Simulate Friendship Mobility for Human Mobile Network.

We have been developing a simple friendship mobility model that captures the essence of how a group of friends travels together or move from one location to another. The idea is pretty simple, but the results are powerful.

First, we collected data from location-based social networks, specifically GoWalla. Second, we used a Marko model where checkins represented states and the transitional probabilities of going from one state to another was empirically defined by the training dataset.  Since GoWalla provides not only the mobility traces of individual users through a process known as checking in, but GoWalla also provides the friendship topology and other cool stuff too! Therefore, each user gets his or her own empirical Markov Model, and the complete Markov Model consists of each one. Once we have the complete Markov Model, we use Miller’s coordinate system to convert latitude/longitude into a Cartesian system that preserve distances.

 

 

This is how a set of friends travels together. Implications? Huge. For instance, take the popular Random Waypoint Model that has dominated simulations in the networking research community and replace that with the Friendship Mobility Model. Or study the migration of a population during a natural disaster; e.g, the recent nuclear disaster in Japan. Enough Said.

The paper, “Using Location-Based Social Networks to Simulate Human Mobility for Mobile Networks” is currently under submission. More results will be presented and datasets/code will be publicly available and disseminated by an agreement with IRB @ RPI under the protocol #1125 entitled, “Data Infrastructure for Complex Social Networks.”

Of course, we aren’t the only one or the first to study human mobility. Check out the awesome work of these folks. http://www.barabasilab.com/ http://cs.stanford.edu/people/jure/