Today I will enlighten you about the incredible fast load times which you will get with a Microsoft Parallel Data Warehouse!
The first thing you have to do when you work with a MSPDW is load your data into the PDW. This initial load will be the most “time consuming” so with this post I hope to help you avoid common mistakes which I made myself with the DWLOADER statement.
The first way to insert data is by developing SSIS packages and then inserting the data into the PDW using the package, I will not be going in deeper on this subject since I prefer working with the second way.
The second way is an exclusive feature for PDW: DWLOADER.EXE. This Command Line based loading tool will ensure the fastest loading times using the infiniband connection. There are many properties to configure DWLOADER, and in this post I will explain how changing these properties will make your load a success and affect your load speeds.
I will start off with giving you guys a sample dwloader statement and from this statement I’ll explain the different values you can configure to increase your loading speed, and I’ll explain the key properties of the DWLOADER (without these properties your load will fail!)
dwloader.exe -M fastappend -m -b 2000000 -i FILEPATH\FILE -T TABLE -R FILEPATH\FILENAME.bad -t \t -r \r\n -dt FILEPATH\dateformatDR.txt -E -S 172.16.254.5 -U BatchUser -P B@tch1234 > “FILEPATH\DWLOADLOGvx.LOG”
- Dwloader.exe : this is used to call the dwloader program
- –M Fastappend: using fastappend will increase your loadrate significantly! Other options are Append Upsert & reload.
- –b: this will define after how many rows you want to insert this load in the columnstore table. It is better to split this up using a custom batch size to avoid tupple mover(tupple mover = stores data in rowstore to dump them afterwards in columnstore) . So it might be interesting to tweak your batch size so that you don’t trigger this rowstore and directly insert into the columnstore.
- –i: The path of the file you want to load into your PDW (Best on a location which is connected to the Infiniband to avoid the bottleneck of your ethernet connection)
- –T: Is the destination table created in your PDW in which DWLOADER will insert the data.
- –R: is the destinationpath were you will keep your log file if something goes wrong with your load (Values that can’t be loaded f.e. wrong dateformat)
- –t: is the delimiter which is used in the file (in the case of the sample statement above this is /t which means it’s a tab delimited file)
- \r: will define the end of your row. This is defined by \n in the example file which is enter.
- –D: If you don’t have multiple dateformats in your file, you can define the default format for your date fields by using the –D property f.e. –D “DD-mm-YYYY”
- –dt: If you have multiple date formats in your file you will have to add the format for every datefield in a txt file in which you put the name of the field followed by the format it uses.
f.e. Birthdate= yyyy-MM-dd
Startdate= yyyy-MM-dd HH:mm:ss.fff
- –S: will define which connection will be used to transfer the data from your flat file to your PDW, be sure to check the IP’s of your Ethernet & Infiniband connection. Using the IP-address of your Infiniband will ensure much faster load speeds
- –E: if you have empty values in your Flat File you will have to use –E which will convert these empty values to NULL. Without this property your load will fail!
- –U & -P: will define your username and password for your PDW
- For additional logging you will have to use > this way you can monitor your load speed a lot better and spot problems a lot faster!
This was an example of a sample DWLOADER statement, however this statement does not guarantee the best loading speed. This is defined by a combination of your dwloader configuration / flat file configuration / Table design. In my next post I will be going in deeper on table design in a PDW explaining the Distribution & Replication theory and how this will affect your load speed.