I’ve been doing a lot of scientific computing recently. Nearly every experiment I run these days involves drawing conclusions from multiple timestamped parameters. Just this week, I was handed one such dataset on the order of 1.3 billion rows (66 GB of raw data). This is a good size dataset, so it’s not the type of dataset you want stored over a collection of comma-delimited .csv files. Whether you’re using MATLAB, Python, R, SAS, SPSS, etc. to perform your analysis, this is the type of dataset that absolutely necessitates the use of a back-end database.
When I say back-end database, I’m referring to any relational database. These types of databases all lend themselves to the use of the structured query language (SQL) that’s so instrumental in manipulating data. As such, there aren’t gigantic differences from one database to the other. Some are more efficient for certain tasks (if you’ve heard of Hadoop’s HBase, it was created specifically for dealing with extremely large sets of data), but since we’re just interested in a data analytics back-end, and not a live production server, the open-source MySQL solution is perfectly suited for our needs.
If you follow the documentation at mysql.com for getting this database installed on a Windows machine, it can be a bit daunting. Luckily, there’s a popular French project known as WAMPServer (Windows, Apache, MySQL, and PHP) that packages these three technologies together in one convenient pre-configured installation. I’ve found it to be the most convenient way of getting MySQL onto a Windows machine. At the time of writing, the most current version of WAMPServer is 2.5. I won’t go through the installation since it’s quite painless, but you can download the necessary files here (in either 32 or 64-bit flavor).
Once you have WAMP installed, depending on your security concerns and whether or not you want to be running a web-server, you may want to take a look at securing Apache or disabling it all together.
Optimizing MySQL’s Configuration
It’s helpful to keep in mind, at this point, that when you installed WAMP, you essentially installed a web-development platform; not technically something explicitly for scientific computing. When WAMP, in turn, installed MySQL, it installed it with the intention of using it within the Apache-MySQL-PHP stack to serve up webpages. That’s a much different use than as a back-end tool for data analytics. As such, there are a couple of tweaks we need to make.
Before I delve into the configuration details, I should take a couple of sentences to talk about database table design. There are several different engines used by MySQL (think of these engines just as different methods for storing and retrieving data), with the two most notable being MYISAM and InnoDB. For several reasons, MYISAM is a good engine to use for web services, but it’s not particularly ideal for extremely large tables. On the other hand, InnoDB is well-suited for this task due to its method of indexing and the fact it was developed specifically to deal with large tables. By default, MySQL employs the MYISAM engine, so we need to do a little work to get InnoDB up and running for scientific computing.
All of the configuration details for MySQL are stored within the my.ini file, which can be easily accessed through the WAMP icon on the toolbar.
Alternatively, you can open it manually from <installation-location<\wamp\bin\mysql\mysql5.6.17
When you open the file, you’ll notice that the very first lines in the file read “Example MySQL config file for medium systems…This is for a system with little memory (32M – 64M)…” Keywords there being medium system and little memory. I’m planning on using this back-end database to crunch through WAY more data than would be practical with just 64 MB of memory. Perhaps you’re starting to see why it’s important to customize the configuration?
Unfortunately, when WAMPServer 2.5 (and 2.4) was released, there was a small bug they forgot to address for the 64-bit version. Before we make any other changes to my.ini, if you have the 64-bit version installed, be sure to change [wampmysqld] to [wampmysqld64]! Otherwise, none of the changes made will be observed. Personally, I prefer to comment out (the hashtag ‘#’ is the comment character for this .ini file) original configuration details rather than delete them.
# The MySQL server
Amusingly, remember how I mentioned that WAMPServer is maintained by a French group? Well, even for the English release they managed to leave French as the default language. Assuming you’re familiar with English since you’re reading this post, make the change the below-highlighted code or else any of the error dialogues you encounter while using MySQL down the line will be in French. Believe me, that will be confusing.
# Change your locale here !
Scroll down a bit further to the line that reads ‘# Uncomment the following if you are using InnoDB tables’. In fact, that’s exactly what we want to do! There are a couple of key changes we want to make to the following paragraphs. First of all, the innodb_data_home_dir and innodb_log_group_home_dir variables should probably be directed to the same location as basedir and datadir around lines 37-39 (if you’ve already created some InnoDB tables before going through this process, these changes are crucial, otherwise your tables will be inaccessible). Furthermore, innodb_log_arch_dir is deprecated, so go ahead and remove that line entirely.
Note: If, after making the changes described below, your server won’t start and the errorlog reads: [ERROR] InnoDB: File C…\ibdata1: ‘create’ returned OS error 223. Cannot continue operation, be sure that your directory paths are described with forward slashes (/).
# Uncomment the following if you are using InnoDB tables
#innodb_data_home_dir = C:\mysql\data/
#innodb_data_file_path = ibdata1:10M:autoextend
#innodb_log_group_home_dir = C:\mysql\data/
#innodb_log_arch_dir = C:\mysql\data/
innodb_data_home_dir = C:/wamp/bin/mysql/mysql5.6.17/data
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = C:/wamp/bin/mysql/mysql5.6.17/data
Here’s where we really get into some of the performance-defining specifics that are directly applicable to scientific computing. Without going into too much detail, the innodb_buffer_pool_size variable specifies exactly how much memory to allocate for caching indexes and tables. For this sort of dedicated back-end, there’s absolutely no need to limit it to a paltry 16 MB. The larger this allocation, the less disk I/O is needed, which ultimately speeds up table access. The comments within the file recommend 50-80% of your system’s RAM, but I found that 50% (of my 16 GB system) was adequate for my needs. Similarly, jog up the innodb_log_file_size variable. Despite the recommendation, I went with 500 MB. Anything over 1 GB didn’t want to initialize in my case.
# You can set .._buffer_pool_size up to 50 – 80 %
# of RAM but beware of setting memory usage too high
#innodb_buffer_pool_size = 16M
innodb_buffer_pool_size = 8450M
# Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size = 500M
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
After making these changes, you’ll want to save the file and restart MySQL. That can be accomplished through the WAMP toolbar menu.
If you want to make sure the changes you’ve made have been initialized, open the MySQL console (link is also found through the WAMPServer toolbar menu) and run the following command:
show global variables like ‘innodb_buffer_pool_size’
If everything went smoothly, you’ll see an allocated buffer size of much more than the original 16 MB!
There you have it, that’s all it takes to actually make use of your computer’s resources when using WAMP’s MySQL server for scientific computing! Be sure to employ the InnoDB engine for this sort of work. My next post will discuss, in detail, some of the considerations that should go into InnoDB table design to make sure queries are served quickly.