- When following this method on a remote server, use a remote file path.
- The full LOAD DATA INFILE syntax can be referenced here.
I’m working on a dataset that was emailed to me as a collection of xml files. Due to its structure, xml isn’t a very convenient format for holding scientific data. Because of that, I pre-processed the files and transformed them into a useful .csv’s. I initially considered using phpMyAdmin’s built-in importer tool to upload these files to my database, since I’ve had good success with the importer in the past. However, these csv files were large, and I had around 20 of them to upload. For as convenient as phpMyAdmin is on quick and easy jobs, it’s decidedly slower than shell commands when importing/exporting data. In an effort to save me a little time, I opened a command prompt and made use of LOAD DATA.
USE database;
LOAD DATA LOCAL INFILE 'C:/{redacted}/data.csv' INTO TABLE tbl_data FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
The thing I like about SQL is that its basic commands read like regular English. First, the database is selected, then the server is informed that data will be imported from a given location and placed into a given table. There are two further statements made that are dictated by how the data within the csv are arranged. Glancing at a screenshot of the data, each “cell” of the csv is defined by a comma, thus the fields are terminated by ‘,’. Text readers are programmed to ignore new-line characters, so no ‘\n’ cannot be seen on the screenshot, but MySQL will still identify the new lines by these regular breaks.
Manually Automating LOAD DATA
Manually automating is redundant. I know, but once the command is entered and run, the database is updated extremely quickly. I’m then able to hit the up key on my keyboard to recall the last command, then simply changing the filename to the next csv file! A batch script could be written, but I was able to upload all 20 files in less than 2 minutes with this process.
Windows vs. *nix Line Breaks
After some initial success, I ran into a unique error that led to a deeper understanding of LOAD DATA. My original data didn’t include any NULL values. However, the pre-processing I did resulted in another column and a few empty cells. This meant that the NULL values SQL was seeing were located within the far right column, right next to where the lines terminated.
This brought to light an error I’d made in my LINES TERMINATED BY statement. The fact is, only in *nix systems are lines terminated by ‘\n’. In windows systems, lines are terminated by ‘\r\n’.
Reading through the official documentation, it is explained as follows:
If you have generated the text file on a Windows system, you might have to use LINES TERMINATED BY ‘\r\n’ to read the file properly. Windows programs typically use two characters as a line terminator. Some programs, such as WordPad, might use \r as a line terminator when writing files. To read such files, use LINES TERMINATED BY ‘\r’.
That being said, the formatting of the .csv can be switched up by Notepad++ if you find it useful.