Exploring Credit Default Swap (CDS) Market Data Using Modern Data Science Techniques

October 2nd, 2018 Title VII of Dodd-Frank Wall Street Reform and Consumer Protection Act addresses the gap in U.S. financial regulation of OTC swaps by providing a comprehensive framework for the regulation of the OTC swaps markets.  The objective of this blog is to describe how to rapidly and securely analyze credit default swap (“CDS”) transaction data using cloud computing and advanced machine learning (“ML”) techniques.  We obtained the CDS data from the Depository Trust and Clearing Corporation (“DTCC”).   A fundamental technology enabler for our customers is the B23 Data Platform which is a Cloud-based artificial intelligence (“AI”) engine to discover, transform, and synthesize data from a variety of sources to provide unique and predictive insights.   The B23 Data Platform is used by data-centric enterprises in many different industries including technology companies, government agencies, and financial institutions to securely use the Amazon Cloud to gain insight from very large data sets. Scope and Accomplishments of our efforts include: Created secure Machine Learning (“ML”) Analysis Cluster in representative Customer Private Cloud Ingested 5 years of CDS data in into the ML analytics cluster in 1 minute Identified anomalous CDS trading activities for complex market transactions Created established CDS compliance reporting metrics Identified CDS market characteristics at an individual products and individual series levels     Identify Anomalous Transaction Activity or Faulty Reporting in Markets B23 investigated several transactions in the DTCC CDS data that looked peculiar.  Figure-2 shows a relatively small number of transactions composed of multiple line items in the data set that exhibited anomalous market activities. In the example above, “yellow” nodes are correction activities, and “red” nodes are cancellation activities. Walking through each set of transactions, the following activities are occurring: A single contract is created for a value of $52M (blue node labelled “new”) The $52M original CDS is corrected...

Four Reasons Why Data Engineering is a Zero-Sum Game for Most Organizations

September 17th, 2018   Data engineering is hard and getting it exactly 100% right results in a single outcome – machine learning engineers and quants can now do their job effectively.  The analysis of data, and subsequent execution of those insights, is the competitive differentiator and core competency of business – its heart and soul.  Data engineering is the commoditized heavy-lifting every organization needs to perform to get the analysis correct.  This is why we see data engineering as a zero-sum game.  Getting data engineering right means organizations are just breaking even – it simply allows other employees to do their job properly.  Getting it wrong means everything and everyone else dependent on data engineering cannot operate effectively. Outsourcing the commoditized heavy-lift data engineering is the least risky and most cost-efficient path to achieve the economic and market leading competitive advantages organizations need to compete.     Prioritize Algorithm Development Over Data Engineering Modern organizations should prioritize and invest in the algorithm development, quantitative research, and machine learning aspects of data science.  These activities can make or break firms who use data for a competitive advantage.  Applying machine learning in a meaningful way using data formatted specifically for those algorithms is not a trivial task.  To be successful, organizations should recognize the undifferentiated and differentiated activities associated with extracting insight from data, and decouple the activities required to get data into a specific format (or schema) to support those algorithms from the development and tuning of those algorithms.   Race Car Drivers and Data Mechanics An interesting social phenomenon we’ve observed over the past several years is that we have yet to meet a data engineer that wasn’t secretly plotting a career change to become a machine learning engineering and/or quant, and with a more data science centric job title to-boot.  If machine learning engineers and quants...

When it Comes to Financial Data, the Power of Cloud can Help you See the Forest through the Trees

May 25th, 2018 At the core of portfolio construction is diversity, and at the core of diversity is correlation. One simple mantra has ruled finance for years: invest in a bunch of uncorrelated assets, and your portfolio will be less volatile. Consider an equities portfolio: how easy is it to spot direct or indirect relationships between companies and therefore between stock prices? If the prices of two equities are correlated in the past, will that correlation continue into the future? I want to explore this question, but to do that that I’m going to need a metric ton of data and a few tools to help me sift through garbage and find the gems. So, in the process, I’ll explore a couple of different questions, like–why is it more painful to run things in the cloud than to get kicked in the kneecaps? How can I simplify my complicated data ingest process so I different people can easily pick up and run with the large dataset I created? Just bear with me for a second, swallow a few of these vegetables I’m about to feed you, and I promise they’ll make you strong enough to tackle the deep interesting questions of the universe. Here’s where I plan to take you: Show you a hassle-free way to ingest a big, messy dataset to answer the question at hand Show you how to secure enough computing power to optimally handle such a dataset Explain stock prices correlation and examine how stable correlations are over time Analyze the market overall from a correlation perspective Show you a sweet movie illustrating the evolution of market relationships over the last 30 years   Getting our hands on that sweet, sweet data One thing that kicked me off on this project is that I happened to have a dataset lying around which I scraped for another purpose. It’s only a few GB, and its living on the cloud in Amazon s3, but the format is a little messy. It’s split up across about 6,000 files- one for each company. Also, I don’t remember the exact folder structure and I really don’t feel like poking around in my bucket before figuring out how I want to slurp it...

Announcing Jupyter Notebook on the B23 Data Platform

March 7th, 2018   B23 is happy to announce we’ve added Jupyter Notebook as the latest stack in our platform. Jupyter has quickly become a favorite with Data Scientists because of its notebook format and support of many programming languages like R, Scala and Python. The B23 Data Platform gives Data Scientists access to their preferred data processing tools in a secure, automated environment targeted specifically to their business needs. According to a recent Harvard Business Review survey, 80% of organizations believe the inability for teams to work together on common data slows the organization’s ability to quickly reach business objectives. B23 Data Platform can help these organizations boost their data science team’s productivity with notebook collaboration and sharing tools like Jupyter. Thanks to easier data access and computing power paired with rich web user interfaces, open source capabilities and scalable data cloud-processing solutions, Jupyter Notebook adds another favored power tool to B23 Data Platform. With just a few button clicks, the Jupyter stack launches.     Open the Jupyter Notebook URL and you are ready start coding!     The B23 Data Platform is an open, secure, and fast marketplace for big data, data science, artificial intelligence and machine learning tools. In minutes, Data Scientists can securely analyze their data in the cloud — with the freedom to use familiar tools like Apache Spark, Apache Zeppelin, R Studio, H2O and Jupyter Notebook. Discover a better way to analyze your data with B23. About the Author: Courtney Whalen is a Data Scientist at B23 LLC and is leading geospatial analysis efforts. She received a B.A. in Computer Science and a B.A. in Mathematics from Elon University. When she’s not busy mapping geographical points, Courtney can be found logging her fitness and distance in long runs, triathlon training and CrossFit challenges.    ...

Experimenting with Chromebook Data Science in the Cloud

September 26th, 2017   Last spring I gave a talk at New York R Conference and EARL SF titled “The Missing Manual for Running R on Amazon Cloud”. It was meant to be targeted at small (or large) enterprise users looking to build out or outfit a data science team capable of doing effective data science in the cloud with all the data ingest, security and usability concerns and implications that come with navigating that space. In recent months I’ve been surprised but overjoyed to see cloud data science start to become championed by citizen data nerds and #rstats community folks in academia. Jeff Leek has been reporting on his second attempt at a Chromebook data science experiment. Brian Caffo has several videos on his YouTube channel about “going cloudy”, including a review of Jeff Leek’s blog post from above. In an experiment of unknown and unplanned duration, I’ve been leaving my work laptop at work on Friday night, then resisting the urge to go get it on Saturday morning. If I need or want to do anything on the computer, be it R or otherwise, I have to figure out how to do it on my Acer Chromebook 11.       The major things, like working with RStudio server on AWS, aren’t all that different from how I operate every day at work. I do find that I’m more likely to “cheat” and use a local-cloud hybrid approach to data management when I’m using my work machine, and I like that the Chromebook forces me to honestly evaluate the usability of the cloud data science system we’ve designed. It’s the little things that have me feeling constrained on the Chromebook. Taking screenshots, managing them all, editing diagrams and trying to create slide deck presentations is all a bit of a drag. So far I’ve felt more effective switching to my phone when I need to do that sort of thing. Making an Acer Chromebook 11 feel satisfying to operate is probably an entirely lost cause, but there is something really fun about having all the power of the cloud at your fingertips on one of the cheapest little laptops money can buy.    ...

Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

April 25th, 2017   The Kinesis Client Library provides convenient abstractions for interacting with Amazon Kinesis. Consumer checkpoints are automatically tracked in DynamoDB (Kinesis checkpointing) and it’s easy to spawn workers to consume data from each shard (Kinesis term for a partition) in parallel. For those unfamiliar with checkpointing in streaming applications, it is the process of tracking which messages have been successfully read from the stream. Spark Streaming implements a receiver using the Kinesis Client Library to read messages from Kinesis. Spark also provides a utility called checkpointing (Spark checkpointing; not to be confused with Kinesis checkpointing in DynamoDB) which helps make applications fault-tolerant. Using Spark checkpointing in combination with Kinesis checkpointing provides at-least-once semantics. When we tried to implement the recommended solution using Spark checkpointing, it was very difficult to develop any code without breaking our checkpoints. When Spark saves checkpoints, it serializes the classes which define the transformations and then uses that to restart a stopped stream. If you then change the structure of one of the transformation classes, checkpoints become invalid and cannot be used for recovery. (There are ways to make code changes without breaking your application’s checkpoints, however in my opinion they add unnecessary complexity and risk to the development process as cited in this example). This challenge, in combination with a sub-optimal at-least-once guarantee, led us to abandon Spark checkpointing to pursue a simpler, albeit somewhat hacky, alternative. Every message sent to Kinesis is given a partition key. The partition key determines the shard to which the message is assigned. Once a message is assigned to a shard, it is given a sequence number. Sequence numbers within a shard are unique and increase over time. (If the producer is leveraging message aggregation, it is possible for multiple consumed messages to have the same sequence number) When starting up a Spark...