from IPython.display import Image
Image(filename='Anaconda3\\output\\frontispiece.JPG')
Chesler Park, Needles District in Canyonlands Nation Park, Utah
For years I used SAS® software in my professional career. I have always been impressed with its flexibility, its ability to manage large data sets and its ability to take input from just about any source. These characteristics help to explain its ubiquity in businesses throughout the world. In the early days of data analysis business users had few choices for tools when it came to doing ad-hoc analysis.
One of the tools available then was Base SAS® software. For clarity, we will refer to SAS and Base SAS as the language as opposed to the company, SAS Institute Inc. Back then there was no concept of self-serve so users were mostly left to queue their requests for analysis and reports to a central IT group.
Eventually, a few intrepid users discovered that SAS software was used for mainframe capacity planning. Mainframes were the dominate systems back then, and their costs were large enough to warrant the practice of analyzing utilization. SAS was the primary software used for this activity. CEO’s needed to know when they were obliged to deliver a substantial capital expenditure check to IBM.
By taking matters into their own hands users learned the mechanics of submitting SAS batch jobs (no interactive processing back then) and soon discovered they too could access data, munge it, and produce the sorts of reports and analysis that had meaningful impacts. As the number of computing platforms expanded thought the 1980’s and 1990’s SAS became available for them as well. All of which lead to a substantial number of SAS users.
Another important aspect of SAS’ enormous success is the formal gathering of requirements for new features and capabilities through its <a href=” http://support.sas.com/community/ballot%E2%80%9D%3ESASware Ballot® . It is a real challenge to put into practice what sounds like such a simple idea.
All of this sounds quaint in light of today’s ability to visit a web page and by simply clicking a few buttons, you can spin-up a cluster of hundreds or even thousands of machines with an enormous number of proprietary and open-source software components. All of this is available in a matter of minutes by just using a credit-card.
A lot has changed since back then. And that’s my motivation to write these examples, in the spirit of learning additional ways to analyze data.
A trained historian might call the foregoing an insurgency, and indeed, that is what is was. The rebels were the business users who sought to overthrow the stifling order of the day. They looked for work-arounds to gain access to useful data locked-down by rules and rulers and SAS was their agency.
Sometime the rulers might grumble about 'referential integrity' to end the conversation of giving business users access to data on 'technical grounds'. And the business user had to have lunch with their IT colleague just to find out what that was all about. The google verb was not in the parlance of the day.
In effect, SAS is the original insurgent in the data openness movement. But once again, history can provide clues about what might happen next. As with most insurgencies the rebels eventually become the establishment and before long a new generation of rebels seek to overthrow the existing order of the day.
SAS Institute just released SAS Viya™, , its modern high-performance, in-memory platform. SAS Viya is described as being open, both to a range of modern and traditional analytical methods as well as all user types. With new interfaces for interactive and visual audiences, a notable feature is the provisioning of REST API's for application developer services with interfaces for Python, Java, Lua (and in 2017, R) as well as accessibility for traditional SAS programs.
SAS continues its historic pattern of investing in software products that are adaptable for the current times. It also recognizes the expanding range of skills found among today's data scientists, business users, and students that include Python and others.
The audience for these examples is the traditional Base SAS user or programmer who wishes to expand their repertoire of skills. It’s also for application developers and architects who don’t know SAS but want to call SAS methods directly from their applications or by simply coding in the interface of their choice.
Learning a new skill is exciting, and it sometime has its challenges. I made an effort to organize the chapters and create the SAS analog programs for a compare and contrast approach. Some of the challenges I encountered, I intentionally left as errors. Sometimes, mistakes make a good learning device.
An insight I gained while working with the International Institute of Analytics was how investments in analytical know-how was significant and at the same time, the results did not seem to pay anticipated returns. IIA uses a benchmark to measure a firm's analytical maturity with a quantitative and qualitative assessment process described here .
Executive management, whose focus is on strategic issues, tended to see the enormous growth in the data collection rates (and of course costs) negatively correlated with insights delivered. But, they could also see how the open source insurgency inside their own organizations was a positive force to alter the existing order.
Hearing these comments and thinking back to my experiences while working for SAS Institute. I often saw how much energy is spent on low-return activities like copying and pasting data, forcing Excel to conform to a proper format, fragile programs that ran purely for the need to do format translations and so on. The permutations for data interchange is quite large.
To give a feel for the scope of this challenge, I created a hypothetical crossing of department by server by file formats easily encountered in a large enterprise today.
from IPython.display import Image
Image(filename='Anaconda3\\output\\product_format_os.JPG')
One might quibble with a couple of non-existent crossings being in the table. The number of cells is 504. But consider all of the formats not included, and the fact that most large enterprises have scores of department. By any measure, the scope of this issue is quite large.
Everyone simply assumes this is just a part of the data munging process one goes through. Yet, it does not have to be that way. If data assets were completely fungible then the return rates on insights could grow significantly. In the open world of data science, data assets should be interchanged more easily.
To give example to the current insurgency I mention above, the open-source community is already working to address a part of the data interchange issue. The Apache Feather project, described here is working toward "...improving interoperability between Python, R, and external compute and storage systems. This effort is very compelling. Imagine switching between a Python and R interpreter and the DataFrame is rendered appropriately in the current context.
Imagine further if the original insurgent, the database vendors, and the noSQL crowd were to contribute resources to this effort. It would mean an enormous reduction in the amount of unproductive efforts users go through every day and significantly accelerate the return rate for insights gained over the data collection efforts.
I would encourage the SAS user community to use the SASWare Ballot or any other means they have to encourage SAS’ participation.
SAS is certainly a long-standing leader and a significant part of the data science community. If one simply considers the product from the number of years SAS has been utilized by the quantity of data it holds in .sas7bdat files alone, then SAS software has one of the world's largest collection of data assets already organized for analysis and modeling.
A claim to openness is only as good as the actions used to benefit the entire community and not just a portion.
If you have feedback, which is always appreciated, or have ideas for additions, improvements, or amendments, you can contact me at tr.betancourt at comcast dot net.