CPAT Concept is Proven
It works! ...well, sorta
I've been working on my CPAT project for a while now. Quite honestly, I've grown tired of looking at ideas instead of code. So, I felt I needed to push for the inflection point where it was either proven, or failed in spectacular over-ambition.
The last real progress for this project I shared was from 2018, and mentioend the project was moving to React. While yes, I did finally make the move to React, I also completely redesigned the system.
Recap: What are we trying to accomplish?
In short, it's a real-time, collaborative, decentralized application for network pentesting engagements.
Way back in 2014, while working with a friend in college, we wanted to know if it was possible to build data aggregation for OSINT tools. This was fueled by seeing real-time data moving in Meteor.js. This project went on to get us through our senior capstone class, and even spark a research paper. Fun stuff.
Before any of those motions, it was the LAIR project from DEFCON 21 that got me thinking about the possibilities of mixing network penetration testing and collaborative, real-time web applications. LAIR was a project sponsored by FishNet Security (now they're Optiv?), and based on their slide notes it appears they were interested in a few of these hypothetical benefits too:
"Simplifies real-time synchronization of information across multiple, distributed clients"
- Reduces duplication of effort: Workflow, Status tracking
- Enhances information sharing: Credentials/hashes found, Manually identified vulnerabilities, Successful exploitation, False positives, Screenshots
- Team Instant Messaging
I liked LAIR being real-time and collaborative, but I wanted to see if I could introduce a new feature: data decentralization--ideally geographically decentralized.
At the time, I didn't realize how far above my head I was by shooting for all of that together. I still don't, but I didn't then either. ¯\(ツ)/¯
New Tech Stack
For a while I tried to recreate this project with Meteor.js, but I quickly ran into scalability concerns and complexity issues. Meteor scales relatively well for certain application designs, but I was seeking something truly decentralized. Meteor's app server design runs pretty counter to that.
So, I needed to abandon Meteor.js, and choose something else--leading to micro-services. Using a service-oriented architecture did lay the ground work for scaling later on, but it also gave me a means to upgrade any "piece" of the project if technologies were moving faster than I could develop. So, I decided on a new stack:
- Web application
- Core API
- OSINT tooling API
- Data streaming
...and it's orchestrated with Docker for the time being.
Moving Data through the Pipeline
Taking a step back, this "real-time, collaborative, decentralized application" for pentesting, is a streaming ETL pipeline at its core. Since I've never built one of these before, that means I needed to prove the core features could actually work--at least in theory--before moving onto the more sensible software design chores. Especially that data decentralization one.
- Real-time: Websockets connection between React and .NET Core via SignalR
- Collaborative: data from one user's actions automatically propagated to another user's available data (this is complicated)
- Decentralized: no single "node" of the system is a complete point of failure (this is more complicated)
- Full-text search
- Manageable on consumer hardware
This diagram is a rough idea of the architecture.
Tooling: React.js front-end
It connects to
cpat-core via Websockets and basic REST calls. It also sends REST calls to
osint-api when a script kickoff is requested. Currently using Redux for the REST data, but I have a kludgy implementation for the Websockets communication. At this point it's good just to have things working. Relatively speaking, once I figured out how to make a Websocket connection on-demand and on page load...this was the easy part.
Tooling: C#, .NET Core API, NPoco
This is the main "business logic" for CPAT. Even though, Kafka is enabling the stream processing, which is much heavier lifting than
cpat-core, this central API is commanding the data.
cpat-core sets the tone for how data should be treated across the whole system.
Any new data entered manually from
cpat-client is funnelled through here to MongoDB.
cpat-client relies on
cpat-core for Websocket connections, too. The idea being that with Websockets the networking foundations exist for live data to be piped back to the web client regardless if it comes from another user manually entering/uploading data, or some automated script in
osint-api uploading results.
cpat-client has a connection to
osint-api, and then onto
cpat-core. The Python
osint-api sends job metadata to
cpat-client request a script be kicked off. So
cpat-core is storing and retrieving job analytics as well.
On the backend,
cpat-core also has ORM logic to communicate with MongoDB provided via NPoco. This is pretty standard-faire; just wanted to mention it as its another concern funneling data to the data storage.
Tooling: Python, Flask, various OSINT tools
Any script or tool set to be automated via the
cpat-client is "hosted" or configured here. I'm using the Flask framework to build the REST API quickly, and this should be considered highly-experimental. There is a whole slew of security concerns opening up automation tools that run within a CLI. Plenty of logistical issues too. Each tool (for now only
nmap) is kicked off in a new process with
Popen. This means ferrying data is primarily done by managing
STDERR. Not to mention, OSINT tools are scanning across the public internet, and any files dumped from such a tool would also need to be managed.
We can just call this a minefield.
Providing OSINT tools behind a REST API also opens a big question for extensibility. While the theoretical capability of kicking off a few dozen long-running scan tools, asynchronously uploading their results, and watching CPAT re-analyze as the dataset grows larger is enticing...it would be difficult to dynamically add more tooling behind the API. Especially since the tools are tightly coupled to their CLI implementation, require unique automation parameters, and are sending back schema-less data.
Needless to say this is a piece of CPAT I wanted to integrate for testing and "concept" purposes, but for now I'm not keen on deploying this anywhere. This type of tooling can be integrated in other ways.
MongoDB and a note on CockroachDB
MongoDB is currently in the architecture diagrams--it's all pretty standard here. Except...
Long story short, CPAT's current design heavily relies (relied?) on CockroachDB to provide native data decentralization. CockroachDB is marketed as a "resilient geo-distributed SQL database". Impressive. And to be direct, an appropriate price follows those features. Yes, the core product is freely available and open-source, however the CDC feature is limited to paid offerings. This isn't to paint CRDB or CockroachLabs in a poor light--their product is worth every penny. Unfortunately though, until I can provide enough funding CPAT will have to get by with whatever solution can be built with other database systems.
CREATE CHANGEFEED) https://www.cockroachlabs.com/docs/stable/change-data-capture.html#configure-a-changefeed-enterprise
After that discovery, the switch was made to MongoDB. It's not a perfect replacement (honestly, not sure if any other DB offers what CRDB does), but it will suffice while CPAT is in an alpha/proof-of-concept stage.
The keen-eyed will notice this was a switch from a SQL-compliant RDBMS to a NoSQL, document-based database. Coming from Meteor.js initially, MongoDB was actually my first choice just to keep things simple and continue using what I had before. Considering the schema-less nature of the unfiltered OSINT data I would be capturing, all of this seemed to be working naturally anyway. So, I actually had to design around CRDB's relational structure; moving to back to MongoDB removed some of this.
The specifics of the schema changes and some of CRDB's other intricacies are outside the scope of this post, but that is all to say real-time data and decentralization are completely possible just expensive.
Kafka and ElasticSearch
These pieces have given me the most grief. Kafka, ksql-db, and the possibilities for manipulating data in a streaming pipeline are completely new to me. Further, while I've worked with ElasticSearch before, I didn't realize until putting the final pieces together that I hadn't stopped to consider some real data denormalization concerns for what I was planning on indexing into Elastic.
For now, it works enough for me to call it a success. Much of Kafka and ksql remain a black box to me, but after somehow setting up a rudimentary implementation of it all...I was able to push data from
cpat-core to Mongo to Kafka and onto Elastic. This meant:
- Received real-time data flow for ingesting into Mongo and showing up on separate
- Automatic pushing to Elastic for search capabilities
A note on analytics.
The next step here is to integrate more of the ELK stack for analytics. Another win for setting up the backend in Docker containers this early in the project was to enable quick access to Kibana and the Elastic Observability products. It's possible to experiment with Elastic SIEM from here too--unsure if that fits the intention of CPAT though.
So, in theory, the "full" architecture should resemble something like the following when deployed. Consider this diagram a very "hand-wavy idea" though. I hope it's an accurate representation of how data flows through clustered instances of MongoDB, Kafka, and ElasticSearch, but for now I'll admit there's much for me to learn on those systems.
A note on early scalability
Normally, I wouldn't front-load a project with so much effort for scalability. However, that aspect was central to the original idea that inspired me to try this at all. Without horizontal scaling and some semblance towards decentralization, the project could've stayed on Meteor.js or just reverted to being a CRUD/ETL workflow focused on network pentesting. All of which is fine just not as fun.
Currently, the UI is basically nonexistent. However, it is possible to open the application in two separate browser instances (i.e. one "normal", one incognito mode) and create data at the same time. This shows a primitive level of "collaboration", if we're stretching the word.
- Support for separate users (i.e. separate data to separate users)
- Associate data created to specific users
- JWT support and some basic security kind of follows naturally from allowing user accounts.
The first major design issue. It almost killed the thing, and quite frankly still might.
- An understanding of a Kafka stack (Apache or Confluent based)
- Solve the failing ElasticSearch sink connectors
- Figure out how to provide data transformations, aggregations via
- *Funding for CockroachDB
This is theoretically proven through seeing how others have operated CockroachDB and MongoDB at varying levels of decentralization. See the note on funding for CRDB below for why I opted for MongoDB over CRDB. This feature is key to demonstrating whether it's possible to have a truly cloud-agnostic deployment one day. Unfortunately, it also stands as the most difficult to develop/test independently--so, we must mostly rely on the stories from others' adventures.
- Engineering effort/time into testing a 3, 5, and 7-node MongoDB deployment.
- Determine how to operate a globally distributed rollout of MongoDB. If it can't be done, document it.
- *Funding for CockroachDB
To even attempt this project in a monolithic fashion is a non-starter. The desires of the project are micro-service oriented just by nature...and it works.
- A sane way to enable HTTPS across all services.
- Revisit the various data models used as data is communicated across each service.
This is probably the most important feature of the entire application. It's where the reporting data/queries will come from, and it's an entry point into natural language processing (NLP) and other AI/ML enhancements. A lot of stuff is left to uncover here.
- Data denormalization
- Fixing the ElasticSearch sink connectors bringing in data from Kafka
- An indexing strategy: one global index, and consider other indexes for specific data types
- Testing with a 3, 5, 7-node setup
Manageable on consumer hardware
The most important statement I wanted to make with this project was for what's possible on consumer hardware now, with a little bit of engineering. Development started on a 2011 Mac mini (when the project was on Meteor.js), and is now on a more modern, powerful machine but still very "consumer". Currently, it's being developed with a Ryzen 7 3800X, 32GB of RAM, and "enough" SSD storage space. It's not cheap or "entry-level", but still approachable.
- Nothing really.
- Pin down minimum hardware requirements? (Probably quad-core CPU, 8-16GB RAM)
*Funding for CockroachDB--CRDB is prohibitively expensive for where this project currently stands. At ~$1800/CPU core/yr for on-premise, or ~$85/mo for a SaaS-based cloud offering...the personal funding simply doesn't stand currently. To add to this, the only feature needed is CDC; the rest of the open-source version of the database works just fine. So, shy of production needs, the project will be sticking with some other type of database technology (i.e. Mongo + Elastic)