I have built a RAG based LLM coach and I am currently experimenting with a new ISO27001 compliance agent. Two different products and both hit the same issues. What I have learned is, aside from the non-deterministic properties of Large Language Models, the real difficulty is the collection of data within your knowledge base, and that’s a much bigger task than anyone gives it credit for.
The setup looks deceptively simple
When you start and have a play around the quick success makes it look like you have solved a problem. The flow looks something like this. Do some model evaluation and find one you like, if you are using Bedrock then no doubt a Claude model will be your pick but nevertheless model selection based on your task is a key part.
Then you have your model, you wire up a vector store this can be on AWS OpenSearchService (costly), or S3 Vectors (which is around 90% cheaper) or something like Chroma which I was experimenting with local vector store. Then decide on your chunking logic, this could be one chunk per file, per section of a file etc very much depending on token context amounts and then use an embedding model to do the work of getting your files/data into the vector store.
You very easily end up with a working RAG based LLM, where you take a query / user prompt -> embed the query -> the vector store finds nearest chunks and returns them -> You take these chunks and prompt the LLM for an answer. Then you have an answer based on what was found in your knowledge base, especially if you are careful and restrictive with your prompt engineering.
Working and working well are different things
But working well on some predefined queries, against what’s most likely a thin collection of information. Everything seems fine but then you will quickly run into lots of edge cases and imperfect answers. What you learn when testing systems like this is really you can’t code your way out of it, unless you fall back to the models training data but then you are losing uniqueness, the richness and the authenticity of your solution vs what all the other foundation models offer within their own apps.
The experts don’t have the time, and that’s the real problem
I had this issue first hand at MEDDICC, it’s hard to get the time of the CEO (Andy and main MEDDPICC man) + Pim (the second MEDDPICC man) and really sit down with them to create the volume of data that makes a difference. You can get pretty far with what you can access, but to really handle a lot of user questions, especially because you don’t control inputs, it just takes an ungodly amount of time the experts just don’t have.
Even on this project I am building now, the best person for the job to create and curate the data is an ISO27001 consultant billing 2k/day with everything she has learned helping many clients successfully get through the audit. But that person is just not going to sit and spend the time creating the data.
You can lean on AI to help, but that only takes you so far and really strays from authenticity with the added risk of hallucinations.
A real example: the pentest question
Take my current investigation project, the ISO27001 coach. I queried the system on a real question I had “We have never done a pentest, do we need to do one to be compliant?”. It was able to fetch relevant files (chunks):
[dist 1.707] # A.8.8 Management of technical vulnerabilities
[dist 1.759] # A.6.3 Information security awareness, education and training
[dist 1.804] # A.7.2 Physical entry
All three scores were above 1.7 with no clear winner, when you see results like this it signals that there is no confident answer, just the best available guesses. There’s no universal scale, it’s something you calibrate against to ensure you have the right content being returned for the query in question. On the prompt side and what was returned it gave a decent enough answer.
Based on the control context provided, I cannot find a specific requirement for penetration testing to achieve ISO 27001 compliance.
The controls mentioned (A.8.8 on vulnerability management, A.6.3 on security awareness, and A.7.2 on physical entry) do not explicitly require penetration testing.
However, A.8.8 does require that "information about technical vulnerabilities of systems in use is obtained in a timely way" and that "the organisation's exposure is evaluated." Depending on your risk assessment and the sensitivity of your systems, a pentest could be one method to fulfill this requirement, but it is not the only way.
To give you a definitive answer on whether a pentest is necessary for your specific compliance, I would need to review:
- Your organization's risk assessment
- Your vulnerability management procedures
- The full set of A.8 controls (not just A.8.8) as they may address this more comprehensively
I recommend discussing this with your compliance team or auditor in the context of your overall risk profile.
I know the real answer, because I’ve had this conversation with auditors. You only have to do a Pentest for ISO27001 if you say you are doing one as part of your Vulnerability Management Programme and within your IT Manual. There’s plenty of vulnerability and security management you do need to do (I do recommend anyone with enterprise customer base just gets one done at least annually). But the Pentest itself is only mandatory because you committed to doing it.
That extra detail is not within the content, therefore with the prompt guardrails in place, the agent won’t answer with that. It just answers with what’s in the data.
If you are curious my basic system level prompt for the AI model was:
You are an ISO 27001 auditor. Answer using ONLY the control context
provided. If the answer isn't in the context, say so plainly rather than guessing.
The data is the product
The key thing I have learned is that the data you collect, the knowledge base itself, is the product. You should spend time creating it, curating it and refining it over time. Adding more and more data to it as you learn what people are asking. The moat is not the tech stack, the LLM model additions you add on top of the knowledge base, it’s the accumulated domain knowledge captured that no one has taken the time and effort to build.
Building a RAG based product arguably is easy, but building something truly useful is the hard work.