Jitterbit: Text Lead To SQL Gold

Jitterbit: Text Lead To SQL Gold

Jitterbitting Firewall CSV Logs to MySQL

Jitterbit Alchemy: Turns Text 'Lead' Into SQL 'Gold'

When Aristotle said "The whole is more than the sum of its parts" could he have perhaps been referring to the melding of seemingly worthless, decentralised and unwieldy text based data, into centralised, web application friendly SQL databases? 

Seemingly worthless data is all around us these days - but could some of it just waiting to be presented in such a way, that it could very well transform our businesses?

In the mid 90s, during the visit of a telephone engineer to the IT room, I found myself curiously looking over the engineers shoulder as he configured in a few more extensions for sales on the companies ageing PBX system. To check that everything worked, the 'call screen' was brought up, and this screen outputted (via Windows hyper terminal) lines and lines of neatly structured data records. Each line being the record of a call, from a number, to an extension, the start time and the end time. It was a logging system purely for diagnosis purposes. Not just for hardware diagnostics, I thought.

I quickly hatched a cunning plan in which I could finally take my sweet revenge over the dolts in sales. 

Using what was essentially some market research survey software, I built up a 'survey' which asked all the 'questions' in order to match with the 'answers' coming from the phone system. What is your telephone extension? What is the incoming call number? What is your outgoing call number, what time did you start your call? What time did you end your call? The phone system data was then dumped from hyper terminal into a text file and fed in to the market research software. The result... our survey said, was proof of some particularly lazy sales people!!! Nasty, I know, but I was young - and it was for the good of the company.

I can still remember the look on the MDs face, when I presented him with a daily breakdown of the telesales team's phone activities & performance - showing calls by direction, local, national, international and mobile rates - by call frequency and duration grouped by sales team, then by extension. The data ended up transforming the running of the 30 strong sales department. It was extremely useful for the support department too. 

So that there was a great example of a sort of data alchemy. Worthless logging data in it's own right - line by line.  However, when transformed and presented in the right way - all of a sudden you have a priceless and essential business KPI.

In todays world, I know that - with a competent application server such as Adobe Coldfusion, one could write out some code, build some file import protocols and perform something like the above system just as well. However in my experience, getting an application server to consume huge amounts of data on a regular basis, without developing in a lot of 'source handling intelligence' can end up being a major headache. If your source and target datasets are on different ends of a LAN (or a WAN) then the system needs to be smart enough to handle connection dropouts, or just the odd network 'freakout'. 

Out of the Firewall into the SQL Server...

So when for a recent project, we needed to be able to read large amounts of data in from 'flat' text files dumped by firewall log export - we thought it would be much smarter to go out and look for a system that was purpose built for the task of sucking up data from a remote source, performing the transformations, and then moving it out to a SQL database - where we can really get down and dirty with it.

The actual task at hand, was to make better use of the fantastic amount of data which a Palo Alto Networks PA-2050 was providing in the form of its firewall logs. This firewall is exceptional in that it provides a highly granular look at traffic based on different types of network application - rather than simply what's going down port 80 for instance. Talking about web applications (such as Facebook) rather than ports - is far more useful information to customers who want to track their bandwidth use. Yes the firewalls own user interface does have some equally exceptional reporting systems, and these do present the data beautifully - however, the firewalls weakness is that it will only graph out all the data for you. You can't filter these graphs by hosts or collections of hosts. There are other report screens which can be built up as custom reports that can show you filtered application based bandwidth use, but these are SLOW if you want to see any more than a days data. Regardless, I don't want to be running reports on my firewall (nor do I want customers doing this), and I might want to add in some additional custom information. So what then?

Our first port of call was a well known 'log management system' called Splunk. Now Splunk has a bit of a reputation as a formidable system for the processing and contortion of all types of IT related log data across the enterprise. As we needed to work with logging data - we thought it was the obvious choice.

Just as advertised, the software installed painlessly enough, and we soon had it reading through our firewall logs.  Little disclaimer here - memory of this system has faded a bit, but it was a general consensus that, once we had it 'working' - we found it less than intuitive to actually use.  For instance, if you wanted to create a report from your log file data -  it would seem to take an age to create - even though Splunk had imported & indexed it already. You would then 'save' this report, only to find that when you came back for a second look - Splunk insisted on re-generating the report all over again!  On the other hand - one thing that did cheer us immensely was the active and quite enthusiastic community of users very much alive on the Splunk forums - dishing out help to people like us using the free product. People that take the free product, are far more likely to upgrade to the full product if they have had a good experience with the product, and the folks at Splunk are well aware of this. 

There were pros and cons, but I think on the whole, our requirement wasn't really a good fit.

Our data went in to Splunk with ease - but there, it seemed, it was destined to stay. At least that's what the product interface and marketing materials seemed to suggest. It is possible to pass data from Splunk server to Splunk server - but onwards from there it's not clear and thus we assume - not entirely easy.  For this project - we only wanted to leverage existing application skills in house, and build our reporting system on this - so the idea of learning a whole new Splunk lingo didn't really appeal - and for our requirements - not necessary. We just needed to ship some data around - not present it in a glossy dashboard across an enterprise. The killer blow for the free version of Splunk was the 500MB daily limit for log importing. Because log file sizes change all the time - we found this to be a seriously off-putting limitation. What if we had a particularly busy 'log day'? Splunk would hit the limit and simply down tools. We couldn't rely on that. Of course, for a free system - how can you complain? So what about buying it? Unfortunately -  for some reason Splunk take you from a 500MB limit on the free version to the unlimited  'Enterprise' version with no price break in between. With the Enterprise version starting at $5,000 - for our requirements anyway, I'm afraid - Splunk flunked.

So we came across Jitterbit - the free community server version. Here was a system that seemed a much sweeter match for our requirements. This software was designed not specifically as an IT data log centralisation system - but more generally - a data transformation system. Pretty much, any data, to any data.

Starting up with Jitterbit was not at all as smooth as the Splunk system, living up to it's name by giving us a minor case of the installation 'jitters'.  On three separate Windows 2003 servers - all installations failed for one reason or another. The system installs it's own version of Tomcat, Apache and also the PostgreSQL database system - naturally for a working system, all three components need to work. We ended up installing it on a spare Linux server - which finally spluttered into life after tweaking some environment paths. 

One you have the server component running - next you install the client software on your desktop - unlike Splunk which is an all in one server product in which you 'log in' to the server via a browser. Installation of this was much more streight forward, and only a bit of firewall tweaking was needed to connect our client to the Jitterbit server. The client interface for Jitterbit does take a little getting used to - but the online help is good. Their online community help is we're sorry to say - pretty darn awful, so don't expect much (or any) help at all. We had a couple of newbie type questions at the start of our evaluation, none of which seemed to deserve and answer - even after desperate 'Hello - is anyone there!!?' type follow ups.  If it's the policy to not provide any support for people that have downloaded the free version - then, please - Jitterfolks, just say it. At least charge people to access the forums. Don't simply ignore us users! 

Regardless of the non-existant community support - with a bit of good old trial and error together with the useful in context online help - we got there in the end. 

Jitterbit is a very logical system. You create a project, define your data sources (inputs), and then your data targets (outputs). You can go from CSV to CSV, LDAP to SQL, SQL to XML, Web Service to CSV. Whatever you need. The most popular use for the product seems to be the syncing of cloud based data from SaaS CRM solutions like Salesforce.com to and from back office databases. We can see why Jitterbit would be an excellent choice for this. 

In our particular case - we were taking our firewall logs, from their native text based CSV - into SQL, a MySQL database specifically. Each log file was something like 400MB in size, so the first thing to do was to load up a sample, same structure, just fewer rows, to make it easier / quicker to work with.  This sample file was loaded in, in order to act as a template for a 'text structure' - where you are essentially defining the structure of the text file you are importing. You can create multiple structures, and if you were going to take data from CSV to another type of text format, you could create one structure for the input, and another for the output. Likewise for CSV to XML, you would write out an XML target schema.

The actual source (where you tell Jitterbit to get your data from) could be a network share, SFTP / FTP for text files - HTTP for XML, Database connections or LDAP. In the same way, you set up your target, based on where you want to churn your data out to. After setting up the target - the next task was to create the 'transformation' - which defines what you do with the data as it passes  from source to target. You select the input text structure which we created for the input data, and then you configure the output - so here we're setting up the MySQL database target we configured previously. Next is the really neat part. An intuitive GUI field mapping tool which enables you to drag and drop fields from your input text structure, to your output database. The only thing to watch out for is to make sure that data passing over to your target is doing so in the correct format for the DB field type. So for instance, passing a date/time data over in the wrong format will give you an error - so you can get Jitterbit to tinker with the date as part of the transformation in order to fix this for you. 

What A Transformation: Drag and drop text to SQL

For this project, the firewall text log had hundreds of columns, but we only wanted to move five to our database. Using the field mapper - we mapped our five over, and that was that. You can then test out your transformation with some local test data to ensure that it all works.

The last link in the project is to define the 'Operation'. This is the place where it all comes together. You set the source, transformation, and your target - and any pre and post transformation scripts. From here you can also set your schedules, so that the Operation is run on a timetable to suit. Of course a single project can have multiple sources, targets, transformation and operations - meaning that potentially, it would be possible to hook Jitterbit up to gather and sync data from a multitude of different sources, into one central target, or perhaps multiple different targets!

Once up and running, we found Jitterbit extremely robust at handling source data, and we deliberately disconnected our test system while it was mid 400MB log file download - we were pleased to note that it was able to recover without issue each time. Collecting the source data over FTP/STFP is straight forward. There is an option to delete the log file after download so that you don't end up with a maxed out log file hard disk. Alternatively, you can rename the file so that it no longer fits the definition of the file type that is read as input data, and as such, not re-imported, but identifiable as have been processed. Neat feature.

After some challenges getting to grips with the free Jitterbit Community Edition, we now have it effortlessly importing logs from the Palo Alto PA-2050 over to our SQL database - on average, 500MB of data, resulting in around 1,000,000 records a day - currently without fault. All of this, on a free system - with no data size import restrictions! Another crucial advantage to using the Jitterbit system is that - thanks to the GUI tools - it's easily possible for someone without developer skills to be able to specify, create and perform the entire data transformation process - which for us meant that developers could focus their valuable time working on the front end 'client visible' application - rather than wasting huge amounts of time on building a highly boring data import system.

For commercial users, we would recommend taking up Jittberbits paid support, as their community support seems so lacking. Of course we would also recommend checking out the differences between the paid and free versions of the product to ensure that your requirement is covered in the functionality.

For us - Jitterbit Server Community edition has not only converted our data - it's converted us too!

Post a Comment
  1. Leave this field empty

Required Field