Monthly Archives: January 2014

How to achieve Ultimate PDW load speed (DWLOADER PART 2)

In previous posts I talked about DWLOADER and the distribution & replication theory, today I’m going to explain how you can combine this knowledge to achieve the ultimate PDW load speed with DWLoader.

There are 3 things that determine your load speed

  1. File size
  2. Destination Table Design
  3. Structure of the file

 

I will start off with giving you more information how file size will affect your load speed, f.e. you have got a Fact table that has about 1 000 000 000 rows, if you want to load this file straight into your PDW with DWloader, you will see that your load speed is not getting the results you expect while working with Infiniband. How did I solve this issue? The solution for this decrease in load speed is quite straight forward, you just use a file split program to split your file up in the desired number of rows, I tend to use files which contain 100 000 000 rows. You can achieve this with GSPLIT tool or other. After splitting these files I loaded them in parallel using a BATCH to max out the Infiniband.

Now once you have split up your file we can move on to the second thing that affects your load speed, namely your Table Design and more specifically your choice of Distribution Key. If you want more information on how to select a good Distribution Key I suggest you read my previous post (https://swyssql.wordpress.com/2014/01/21/pdw-distribution-replication/). If you have followed the four steps for selecting your distribution key, your load speed will not be affected by your table design, but when you notice that your load speed is a lot slower than expected it might be interesting to look at the key you chose.

The last and final part that defines your load speed is the structure of your file. As you know PDW achieves incredible results when it comes to processing unstructured data, however when you load data into your PDW that is structured it might cause problems to your load speed. The thing you have to avoid is choosing a distribution key which is structured in your source file. And this brings me to the most important conclusion I made while tuning your DWloader: Sometimes it might be better to load your data into your PDW using a staging table with a different distribution key then your ideal distribution key. Afterwards you can use the CTAS magic to transform your table, to a table with the best queryperforming distribution key. (ELT principle https://swyssql.wordpress.com/2013/12/13/basic-introduction-to-ms-pdw/ ).

If you keep these 3 hints I gave you in mind you should normally achieve your ultimate dwloader speed (1.3GBPS is what we achieved so far J ).

In future posts I will be talking about the Polybase feature from PDW and how you set up your very own Hadoop cluster to combine Big Data using External table commands.

Stay tuned!

Advertisements
Tagged , , ,

PDW Distribution & Replication

Today I want to talk a bit more about Distribution and Replication and how it is used in a PDW. I’m going to start by answering two questions.

What is Distribution?

Because MS PDW is a MPP system (https://swyssql.wordpress.com/2013/12/13/basic-introduction-to-ms-pdw/), it uses different nodes to ensure fast querying! With the Distribution theory we take our large tables and distribute them over the different nodes and thus avoiding the I/O, Ethernet, disk bottleneck.

In a basic Dwh you will choose your Fact Table as your distributed table, in the case of a quarter stack PDW this will mean that your fact table will get distributed using the Distribution key on your 2 datanodes, the control node will contain metadata about the distribution of the fact table. There are cases when you will replicate your Fact table but in these cases your Fact table will be very small.

What is Replication?

With the replication theory we will replicate a certain table to all your nodes to make your queries return results faster. In most datawarehouses this will be your dimension tables, these tables are small compared to the fact table. Because these tables only need a tiny amount of space per node, it is better for your query-JOIN performance to replicate these tables on all your nodes.

Now that we know what distribution & replication is, I’m going to continue by explaining how this affects your table design in PDW. The first thing you will have to do is identify which tables are replicated or distributed, this analysis can be done quite fast by checking the size of these tables in your original system.
Once you have decided which tables are distributed & replicated, you can start on your table design.

Replicated table

I will start off with giving you an example create table statement for a replicated table.

CREATE TABLE DIM_FUNCTION(

[FUNCTION_ID] [int] NOT NULL,

[FUNCTION_NAME] [varchar(30)] NOT NULL,

[FUNCTION_NAME_SHORT] [char(5)] NOT NULL,

)

WITH 

(

DISTRIBUTION = REPLICATE,  

CLUSTERED COLUMNSTORE INDEX 

);

The first part of your create table statement is exactly the same as your statement in a normal DWH, but the WITH section is something new you will have to use. To define this table as a replicated table you put DISTRIBUTION = REPLICATE in your table design and PDW will automatically replicate this table on the different nodes of your PDW.
You can choose to add the Clustered Columnstore Index part of the statement. In my experience your PDW queries will work better with a Clustered Columnstore Index.This is because the data is now stored column per column (versus row by row) and also compressed in an efficient way, this means if your queries are only doing a SELECT f.e. SELECT LASTNAME,FIRSTNAME you will not read the jobname or function or … . This means you will read less off the disk, this will make your queries go a lot faster!

That’s all there is to creating a replicated table, I will continue now with the Distributed table which is a bit more complex.

Distributed table

Again I will start by giving you an example statement for a distributed table.

CREATE TABLE [FACT](

[file_id] [int] NOT NULL,

[Function] [int] NOT NULL,

[Place] [int] NOT NULL,

)

WITH
(
DISTRIBUTION = HASH (file_id), 
CLUSTERED COLUMNSTORE INDEX
);

As said in the Replication table part, the first part of your statement is exactly the same like a normal SQL server. But the WITH part is something new! You see that this time in the DISTRIBUTION we are hashing some sort of key, this key is your Distribution Key and this is the most important thing you will configure in your database, the right choice of key will result in lightning fast querying!

How do you choose your Distribution Key?

When I choose my distribution key I will analyze my all my columns by asking 4 questions.

  1. Is this column being used in a lot of my queries?
    1. If your column is being used for many joins, then it might be good to consider this column as a distribution key. The most intensive process in your PDW while querying will be the Data Movement Shuffle (Recreating a distributed table on a single node to make a join compatible) aka DMS, this DMS will significantly slow down your queries. If you choose a column which is used in a lot of queries this will avoid incompatible joins, and ensure an excellent query performance.
  2. Does this column have a High Cardinality?
    1. This question is actually the question how to avoid SKEW, skew is something that appears when you distributed your table on a wrong column. If this column contains a lot of NULL values & does not have a lot of distinct values, your system will not be able to highly distribute your data over your different nodes. This will result in one node having a lot of data and the other node having just a small amount of data. When this happens you actually only use One node of your MPP system and I think I don’t have to explain why this is bad for your query performance.
  3. Is this column compatible with my Aggregates?
    1. If your column is compatible with your aggregates this will improve your query performance.
  4. Will this column ensure a nice balance between workload & query behavior per node?
    1. This is what you ask yourself after the 3 previous questions and then you can determine what the best distribution key is for your DWH, I tend to use a point system which I give to my different columns to compare them so I can easily choose the most logical distribution key.

In future posts I will be explaining the relation between your table design and your load speed (DWLOADER PART 2), and how you will tune queries until you have maximum performance.

Stay tuned!

Tagged , , , , , ,

Load your data into your PDW (part 1 the DWLOADER)

Today I will enlighten you about the incredible fast load times which you will get with a Microsoft Parallel Data Warehouse!

The first thing you have to do when you work with a MSPDW is load your data into the PDW. This initial load will be the most “time consuming” so with this post I hope to help you avoid common mistakes which I made myself with the DWLOADER statement.

The first way to insert data is by developing SSIS packages and then inserting the data into the PDW using the package, I will not be going in deeper on this subject since I prefer working with the second way.

The second way is an exclusive feature for PDW: DWLOADER.EXE. This Command Line based loading tool will ensure the fastest loading times using the infiniband connection. There are many properties to configure DWLOADER, and in this post I will explain how changing these properties will make your load a success and affect your load speeds.

I will start off with giving you guys a sample dwloader statement and from this statement I’ll explain the different values you can configure to increase your loading speed, and I’ll explain the key properties of the DWLOADER (without these properties your load will fail!)

  •  dwloader.exe -M fastappend -m -b 2000000 -i FILEPATH\FILE -T TABLE -R FILEPATH\FILENAME.bad -t \t -r \r\n -dt FILEPATH\dateformatDR.txt -E -S 172.16.254.5 -U BatchUser -P B@tch1234 > “FILEPATH\DWLOADLOGvx.LOG”

  1. Dwloader.exe : this is used to call the dwloader program
  2. –M Fastappend: using fastappend will increase your loadrate significantly! Other options are Append Upsert & reload.
  3. –b: this will define after how many rows you want to insert this load in the columnstore table. It is better to split this up using a custom batch size to avoid tupple mover(tupple mover = stores data in rowstore to dump them afterwards in columnstore) . So it might be interesting to tweak your batch size so that you don’t trigger this rowstore and directly insert into the columnstore.

  4. –i: The path of the file you want to load into your PDW (Best on a location which is connected to the Infiniband to avoid the bottleneck of your ethernet connection)
  5. –T: Is the destination table created in your PDW in which DWLOADER will insert the data.
  6. –R: is the destinationpath were you will keep your log file if something goes wrong with your load (Values that can’t be loaded f.e. wrong dateformat)
  7. –t: is the delimiter which is used in the file (in the case of the sample statement above this is /t which means it’s a tab delimited file)
  8. \r: will define the end of your row. This is defined by \n in the example file which is enter.
  9. –D: If you don’t have multiple dateformats in your file, you can define the default format for your date fields by using the –D property f.e. –D “DD-mm-YYYY”
  10. –dt: If you have multiple date formats in your file you will have to add the format for every datefield in a txt file in which you put the name of the field followed by the format it uses.
    f.e. Birthdate= yyyy-MM-dd
    Startdate= yyyy-MM-dd HH:mm:ss.fff

  11. –S: will define which connection will be used to transfer the data from your flat file to your PDW, be sure to check the IP’s of your Ethernet & Infiniband connection. Using the IP-address of your Infiniband will ensure much faster load speeds
  12. –E: if you have empty values in your Flat File you will have to use –E which will convert these empty values to NULL. Without this property your load will fail!
  13. –U & -P: will define your username and password for your PDW
  14. For additional logging you will have to use > this way you can monitor your load speed a lot better and spot problems a lot faster!

This was an example of a sample DWLOADER statement, however this statement does not guarantee the best loading speed. This is defined by a combination of your dwloader configuration / flat file configuration / Table design. In my next post I will be going in deeper on table design in a PDW explaining the Distribution & Replication theory and how this will affect your load speed.

Tagged , , ,