DATABERG
“About Data Traffic”- an insightful interview with a CTO expert
“Data traffic” is a technical term, which is becoming more and more used in the business field: Lots of companies are talking about it, especially the ones that have adopted a data-driven culture.
Nonetheless, many private individuals and organizations from various industries find the term “data traffic” to be explained quite vaguely by the media. Thus, they only have an idea of what it represents, but not what is the essence of it.
So we asked Petri Launiainen: technology expert, author, former VP of Nokia, and currently CTO of Datumize, to help on this matter and answer some basic, but important questions regarding data traffic.
In this insightful interview, the link between business, technology, and data is clear. And exactly this relation facilitates leveraging knowledge from data to business and vice versa.
1. What is data traffic by definition?
All information transfer is migrating into digital form these days, so it is fair to say that “data traffic” is pretty much any traffic.
Even if we send voice over IP, it is turned into bits and travel along the same data paths as anything else, and hence looks like “yet another data traffic”.
There may be network optimizations and priorizations related to the type of data, but it is still all ones and zeroes.
2. Is data traffic the same as network traffic?
Both “data” and “network” are broad terms. As explained above, anything in digital form is now “data”, and network, whether it is a physical network or utilizing radio waves, exists only in order to move those bits around.
So it can be generalized that data traffic is the same as network traffic.
3. What are the primary sources of data traffic that companies use to extract data?
You can look at this from different perspectives: if you only look at the amount of data, networks can be coarsely divided into two categories: the Internet and corporate internal networks.
Video services are over 50% of all Internet traffic, followed by website traffic, gaming and social networks.
As for corporate internal networks, the traffic can vary from data from industrial appliances and sensors to simple sales-related database queries and updates, and lately, increasingly as data movements between organizations and their cloud-based services.
4. Is there a key difference in those sources, in terms of technical ability for the extraction of data?
In general, there are two ways to extract data: either by actively polling or sniffing the existing traffic.
The fundamental difference here is that polling requires separate request-response pairs for any data extraction, and thus adds to the volume of the data on the network. Sniffing, on the other hand, can utilize an electronic, non-intrusive copy of any existing network traffic, of which then the actual underlying request-response pairs are extracted.
The latter does not add to the existing data volume, as the transparent replication of all traffic can be done in a switch on the network, but it requires intelligence in deciphering the underlying traffic, and in case of encrypted traffic, it usually can’t even be done.
5. Are the different sources a prerequisite for data silos?
Data silos are proprietary, often closed systems, many times based on older or semi-obsolete data processing facilities, which were not originally designed to share their internal data with other systems.
Extracting data from such systems usually requires the creation of an adapter that can tap into the silo contents and make it available to other systems, or active “push” to the outside world from within the silo. This tends to have the side effect of increasing the amount of traffic inside the tapped system.
6. Is the format of the flowing data uniform?
Data is sent around by utilizing network protocols, the most common of which nowadays are TCP and UDP. This sets up the framework for the network elements to handle and route data packets correctly.
What is inside of these frames is totally and absolutely flexible and depends on the application. A myriad of internal data representations have been defined, but in most cases the sender and receiver are already aware of the internal structure.
But if you sniff this traffic, you have to test the detected byte streams against various templates in order to figure out what is actually going on. It is doable but requires active, intelligent data gathering.
7. And when it is being extracted, does the format depend on the sources (is there any dependency)?
The format depends on the application, and as explained above, usually the sender and receiver already have agreed on what to expect, or in case of sniffing, the organization can indicate what kind of data to expect.
8. Is the network traffic data purposeful for analytics, or should the data be processed beforehand?
On the highest level, the only meaningful analytics you can extract from tapping into a data stream is the volume of data per given time.
For classification purposes you have to look into the data stream and figure out what is the internal structure, and only thereafter you can do more in-depth analytics. On this level, you usually can see the basic type of the traffic and the addresses of the communicating parties.
At the very lowest level, you can do deep packet inspection, which in some cases lets you see the intricate details of what is being transmitted: you don’t only recognize the traffic as “web traffic”, but also get the website that was queried, the contents of the query itself, and the response sent to the recipient.
9. Usually what percent of the whole data traffic is extracted and used by organizations?
Studies show that only between 5% to 15% of data is either used or at least stored.
10. On the average, what percentage of this flowing data is dark data?
Studies show that depending on the organization, 85% to 99% is dark data.
11. What tools are used for the extraction of trapped data?
With “trapped data” meaning “dark data”, either polling or sniffing, as described above, can be used.
12. What business or individual activities stimulate the increasing data traffic?
For individual users, the use of video streaming services is the biggest source of growth. For businesses, the switch from on-premises systems to cloud services is increasing the network traffic.
13. Is there any tendency for the growth of the data traffic volume?
The figures I’ve seen seem to indicate that the traffic volume is in a constant 15-20% growth per year. It shows no indications of calming down, especially as new high definition video services are popping up all over the Internet. The growth has been very close to being exponential in the last 10-20 years.
14. What are the major challenges that you face as CTO when it comes to data traffic?
In order to help organizations tap unobtrusively to their data, the major challenge is the use of encryption: encrypted data can’t be decoded, so for any use cases like this, polling by using the appropriate encryption keys is the only obtainable approach for catching the encrypted traffic.