Archive for the ‘Data Science’ Category

Security

Posted by: Grant Stanley on March 21st, 2012 Leave a Comment

Security is very important to CAN since we work with our clients’ most sensitive information and provide them insights that are essential to the future of their organizations.  Our clients trust us with their most valuable information including business plans, intellectual property, financial and customer data.  We work daily to respect that trust.  The following is an introduction to how CAN maintains the security of all of our systems, sensitive data, and Contemporary Analysis.

1. Sensitive Data
The first step to protecting CAN’s sensitive data is to limit any unnecessary data. We require CAN’s clients to scrub any sensitive data that is not needed to build models. For example we can often build robust models without having to have access to names or addresses.  A CAN Navigator and Data Scientist can help you determine what data is necessary and how to scrub any unnecessary sensitive data.

The second step to protecting CAN’s sensitive data is classifying data by type, security level, and access permissions. All sensitive information is labeled by client, project, and security level. CAN employees are provided only the data required to fulfill their job description. CAN classifies our data into three major categories, each with a default type, security level and access permissions.
1. Public data is not sensitive and is accessible to everyone at CAN. Public data is information that is available on the Internet and is widely available to people outside of CAN and CAN’s clients.
2. Internal data is sensitive and is accessible to only executives at CAN, and as needed. Internal data is data that is used to operate CAN’s business.
3. CAN Client data is sensitive and is only accessible on a project basis to the data scientists, sales executives and navigators that are working. Permission to the data is removed as soon as the project is completed. CAN client data is any information that we receive from a client, and includes temporary data files that CAN uses to generate deliverables and also deliverables for clients.

2. Data Management

CAN stores all data in a central location and also carefully manages devices and people that have access to specific types of information.  All of our data is stored in a secure and encrypted hosted environment.  Our IT infrastructure is designed so that in case that a device goes missing or is compromised, CAN can identify the location of that device, terminate access to CAN’s network and data files, and remove the encryption key to the hard drive.

CAN has also increased the simplicity and security of our data management policy by not permitting the use of USB powered drives and other external hard drives. With the state of CAN’s network and technology, USB powered drives and external drives are unnecessary and a major threat to the fidelity of CAN’s network and data management. When possible CAN uses SFTP, “SSH File Transfer Protocol”, when transferring data within and outside CAN’s network.

All data transferred between CAN and Client requires a encrypted USB drive or secure FTP. CAN requires that only new encrypted USB drives are used, and that the drive is either shredded after the transfer or stored in a locked container at CAN HQ.

3. Encryption
We use encryption on all disks and devices to add an additional level of security.  Even if someone was able to get a hold of a CAN device or break into CAN’s network without access to the encryption key it impossible to use the data stored on a device.

CAN uses disk encryption on laptops, desktops, mobile devices, and servers. CAN also encrypts data in motion between laptops, desktops, mobile devices, and servers. In addition to device level encryption, each client’s data is stored on encrypted virtual drives. This keeps each client’s data separate, and keys are only provided to the data scientists, sales executives and navigators that are responsible for that specific project. Only data scientists are provided the encryption key to the client’s raw data.

In the event of a security breach CAN is able to revoke the disk encryption key of all of CAN’s laptops, desktops, mobile devices and servers as soon the device is connected to the Internet. In addition, the encryption key is automatically removed and the disk erased after 10 failed attempts to access a device. Once the disk encryption key is removed the data is unreadable.

4. Authentication
It is important that the right people are accessing the right information. CAN uses password, software, and physical access control to protect against unauthorized access. We require that every device used by CAN employees or contractors used when performing their job responsibilities require passwords to access, lock out for 10 minutes after 3 unsuccessful log-in attempts, and after 10 unsuccessful log-in attempts remove encryption key and start erasing the disk. User are required to change their password every 90 days, and will be promoted automatically. Passwords on phones and tablets will have 4 numbers, and passwords on other devices will require 12 minimum characters, and use at least 2 different character classes.

We control physical access to CAN’s facilities. When our facilities are not occupied by CAN employees and contractors an alarm system is used. We are also investing in more advanced access control. In the future, employees and contractors use RFID badges to enter CAN HQ, individual floors, as well as the server room. Each door will also monitored by a camera. Each time a door is opened the badge id and a brief video clip will be recorded. These records will be reviewed once a month, and as needed.

5. Data and Disk Destruction
Disks and drives are stored securely at CAN HQ until properly shredded or destroyed.  All employees and contractors are provided two trashcans, one for paper, disks and drives, and one for other materials. Trash is removed daily, and either stored or properly disposed of.

6. Training
CAN’s employees and contractors are required to help CAN maintain effective security.  Employees and contractors receive security training when they start at CAN, and are required to participate in training each year.  All employees and contractors are required to report any suspected or real security threat or breach.

7. Visitors and Guests
CAN has a lot of people that visit our offices.  All guests sign-in and sign-out at the front desk of Suite 200.  They use their drivers license or photo ID to sign-in.  All guests are provided with a visitor badge.  They are met at the front desk and their host escorts them until they leave the office.  CAN also maintains a separate WiFi network outside of CAN’s firewall for guests and employees that bring their own devices to work.  Occasionally, depending on the nature of a guest’s visit they are asked to leave their devices and bags at the front desk in Suite 200.

8. External Devices
CAN’s employees and most of CAN’s vendors, clients and contractors enjoy technology and are constantly investing in the latest and greatest consumer technology. CAN allows our employees, clients and contractors to bring external devices into our facilities.  However, external devices are not allowed behind CAN’s firewall, and are required to follow CAN’s security policies, including monitoring and management by CAN IT and security staff.

9. Network Security
The fidelity of CAN’s network is essential to protecting ourselves, our sensitive information, and our clients and partners’ networks.  We record and monitor all devices that connect to our network and their activity.  Logs are reviewed monthly and as often as necessary.  We also require that devices, including mobile, tablet, laptop, desktop and servers, use software to protect and identify malware and spyware attacks.

10. Disaster Recovery
CAN’s IT and Security infrastructure allows CAN to respond quickly to national, local, company, and individual disasters.  We maintain copies of all key systems in multiple locations, including static backups at CAN’s facilities.  All key systems, files, and applications are hosted and managed in a professional managed environment.  Using hosted solutions allows CAN to leverage state of the art providers with investments in hardware, facilities, fire protection and redundant backups.  Also, in the event of a disaster using hosted solutions allow CAN’s workforce to quickly relocate to a new physical office or a virtual office environment.  All that we would need is power and an Internet connection.

When designing CAN’s security policies we wanted to make them as simple as possible so that our security policies were easy to remember, follow and enforce.  There are more complex and sophisticated security systems, but simple systems get implemented, and only a security system that get implemented keeps anything safe.  CAN and our clients are confident that these security policies will protect CAN, our clients, and CAN’s sensitive data.  We are continually refining and improving our security policies.

Please feel free to ask any questions that you might have.

How to use FRED’s Add-in to Quickly get Economic Data

Posted by: Jefferson on January 17th, 2012 Leave a Comment

Our data scientists download hundreds of datasets from the phenomenal database of economic research known as FRED.  FRED is maintained by the Federal Reserve Bank of St. Louis and provides open access to economic indicators, employment variables and business trends.  It’s also really useful for getting awesome external variables to use in all your econometric modeling projects.

While FRED has an online interface for extracting, viewing and downloading data, this requires the extra steps of downloading data, importing it into your working files, manually adjusting the time scale, and so on.  If you need to quickly grab, adjust and use economic time series, the FRED add-in is an amazing timesaver, and everyone can download the add-in and learn how to install it on Windows or Mac.

In this brief tutorial, you’ll learn how to use the FRED add-in to download specific economic time series, adjust the frequency of aggregation and time range, and build a simple graph.

To start, let’s get two time series regarding foreign direct investment and domestic unemployment within the United States.  First, let’s go to the “Data Search” function of the FRED add-in and look for Foreign Direct Investment, then add the series ID.  We’ll also find and add unemployment rate to our sheet.  As long as the A1 cell is selected, the FRED add-in automatically spaces the time series.

Selecting our FRED data series

Okay, now we have the series IDs for FDI and unemployment in row 1. Row 2 contains the type of data manipulation, row 3 shows the frequency of aggregation and row 4 shows the first date of the series.  Let’s make sure the A1 cell is selected and click “Get FRED Data” to populate the time series we selected.

FRED series, unadjusted for aggregate and start date

The data has been populated, but now you’ll notice that the frequency of aggregation and start date do not match.  Before continuing, we’ll want to standardize these values, and the plug-in makes it very easy to adjust both frequency of aggregation and start date.

Because we cannot disaggregate data, we’ll have to change the UNRATE frequency to quarterly.  Click cell C3 and use the “Frequency Aggregation” to choose quarterly.  We’ll also need to set UNRATE to the same start date as BOPIPD, and we can do this manually by just changing the value in the cell to “1/1/1960″ or any other start date you choose.

Updated FRED data

Click “Update Data” and the series refreshes with the FRED database.  At this point, CAN typically exports the data for use in SPSS or Gretl. For those using Excel, these time series can be referenced by other sheets, and the FRED add-in is a great way to refresh your spreadsheets and analyses as new data is released.

The FRED add-in also has some built in tools for quickly graphing datasets.  Let’s try it out by going to “Build Graph” and selecting “Create Multiple Series Graph”.  Select the series IDs you would like to view, click secondary axis, and then “Build Graph”.

FRED time series, graphed

Whether you use it for business or pleasure, the FRED add-in is a great way to rapidly download, update and view economic time series data from the Federal Reserve.

Also, the next time you’re out and about, should you find yourself interested in the health insurance coverage rate or the privately owned housing starts in Illinois, FRED publishes an app for your iDevice so you can get some of the best economic data sources “to go”.

Data Scientists are the Future

Posted by: Grant Stanley on December 12th, 2011 Leave a Comment

The Future Belongs to Data Scientists

Data scientists help people create knowledge from data, including sometimes million of gigabytes of data.  An example is iTunes using the number of songs and length of each song on a CD to find the name of the CD, the artist, and the titles for each song.  To data scientists, tracks on a CD are not music, but data.

Until the turn of the century, someone’s knowledge was limited by their access to a library or university.  Now, because of the increasing power and storage capacity of computers, and the increase in data being published, someone’s knowledge is limited by their ability to process data.  For example, for $600 you can  buy a hard drive capable of storing all the music in the world.

Previously, the limited amount of available data made data science nearly impossible.  Today, because almost everything we do involves a computer people produce data as a by-product of their daily lives.  For example, every month 30 billion pieces of contact are shared on Facebook, with no signs of slowing.  In 2009, the CEO of Hewlett Packard stated that, “more data will be created in the next four years than in the history of the planet”.  Contemporary Analysis believes that data science is valuable because it allows us to turn this data into a product that answers important questions and reduces waste.

Data scientists are helping consumers answer questions about where they should eat, what movies they should watch, and who they should date.  It is helping governments answer questions about how to get people to switch to public transportation and more effectively prosecute graffiti artists.  Data scientists are also helping businesses reduce waste by focusing their sales efforts, selecting the right market position, improving planning, and identifying the best employees.

The future of data science is limited by the number of people that are able to extract insight from billions of data points.  The McKinsey Global Institute estimates that by 2018, the United States will need between 140,000 and 190,000 more data scientists skilled in deep data analytics, and 1.5 million data-savvy managers.

To meet this need, we need to train people to have the computer science skills to organize data, the mathematical skills to extract meaning, and the writing and visualization skills to present the data.  After all, according to Google’s Chief Economist Hal Varaian, Data Science will be the “sexy” job of the coming decade.

Simple Science vs. Complex Science

Posted by: Grant Stanley on July 27th, 2011 3 Comments

Science is the systematic study of a phenomenon that includes observation and experimentation to explain and understand why things happen.  We can use science to explain almost everything in our universe from the effects of gravity to the impact on sales of your latest marketing campaign.  However, it is important to understand that there are two types of sciences, simple and complex, and that the answers they produce are different.

In simple sciences, such as physics and chemistry, the best possible answer is exact and often is not subject to changes.  For example, gravitational acceleration of objects in a vacuum is 32.2 ft/s2 no matter the size, density, or shape of the object.  While in the complex sciences, such as economics, data science and biology, the best possible answer can never be exact and is almost always relative to time and situation.  For example, under the current situation if inflation increases and interest rates go up in the next 3 month the S&P 500 might increase in value 400 points with a 25% margin of error.  The reason for these more complex answers is because complex sciences can not study things in a vacuum.

People often struggle with the scope of the problems that economists and data scientists answer, because they want exact answers with universal truths that hold across different situations and time periods.  However even though an exact answer is not possible it does not mean that the answers are invaluable or non-scientific, but rather are more complex.  When thinking about problems, it is important to keep in mind the scope of the problem being solved.  Specifically; can the problem be solved using simple sciences such as physics and chemistry or complex sciences such as economics and biology.  The difference between simple and complex science isn’t the level of importance or the difficulty of the problems, but instead the answers.  (Why Predictive Analytics is Important)