The tech department your business doesn't have — and the AI most agencies can't build
Web, hosting, infrastructure, and custom AI integration for Irish SMEs. We do the work an in-house tech team would do — without the headcount. And we have rare capability most agencies can't match: training and hosting your own AI models on your own data.
Anything tech your
business struggles with
Most Irish SMEs are too small for an in-house tech team but too dependent on tech to ignore the problems. We're the outsourced version — modular if you need one thing, full-stack if you need everything. Below is what we do today; we expand into adjacent areas as we grow.
Web Creation
Marketing sites, brochure sites, simple CMS, contact + analytics — built fast, designed to convert. Modern stack (Next.js / static), mobile-friendly by default, your content under your control.
Hosting & Maintenance
We host and maintain your site so you don't have to. Backups, SSL, security patching, content updates, uptime monitoring, monthly reporting. Predictable monthly fee.
Server & Cloud Infrastructure
Server provisioning, cloud migration, email + DNS + SSL automation, security hardening, monitoring, on-call patching. The plumbing your business runs on, kept boring on purpose.
Custom Software
Bespoke internal tools, dashboards, integrations between systems that don't talk to each other, automation of work that shouldn't be manual. If a person types it every day, we can probably eliminate it.
AI & Data
Custom AI integration where it actually saves time — document processing, internal Q&A on your own knowledge base, workflow automation. Hosted on your infra, on your data — not on someone else's API. Proof we can build this →
Outsourced Tech Department
All of the above, retained relationship, one point of contact for any tech need. You stop juggling vendors. We handle whatever lands. Monthly retainer, scoped to your business.
A ladder, not a contract trap
Every engagement starts the same way: a free 30-minute call. From there, you decide how far you want to go. No long contracts to sign just to get a conversation.
Free discovery call
30 minutes, no charge, no commitment. You tell us what's broken or what you're trying to build. We tell you whether we can help, and what it might look like.
Tech & AI Consultation
A 90-minute deep-dive. We audit what you have, identify the highest-leverage opportunities, and leave you with a written 30 / 60 / 90-day roadmap — yours to keep, even if you don't engage further.
Project
A scoped engagement to deliver one or two of the things from the roadmap. Fixed-fee where possible; clear milestones; weekly progress updates. The consultation cost comes off the project.
Outsourced Tech Dept
Once we've delivered, the natural step is a retainer. We become the team you call for anything tech. Predictable monthly fee, single point of contact, scoped to your business.
When we say "AI-capable,"
we mean it
Most agencies sell "AI integration" by reselling someone else's API. We can do that — but we can also build, train, and host the model itself. Below: real measured gains from a 35-billion-parameter language model we trained ourselves. Full technical detail is one click away if you want it.
Optional View full benchmark detail, methodology & sample outputs
600 tasks. +21.2 pp lift.
Statistically dominant.
Same base model and pipeline as Phase 0 — but trained on the full 40,100-record S+ curated corpus for two epochs. This is a capability-gain campaign. Full A/B head-to-head against the unmodified base model, 1,200 tasks across six suites. Wilson 95% confidence intervals do not overlap.
Base vs Nexis Phase 0.5 — head-to-head on 600 held-out tasks
Both models served from the Modal volume at the same quantization (GGUF Q4_K_M, llama.cpp CUDA, all layers on H200) with identical sampling config. Every task programmatically verified — code execution, numeric tolerance, regex, format rules. These are the raw numbers. The base model runs without a system prompt; the tuned model runs with the Nexis system prompt that matches production serving.
| Suite | What it measures | Base | Nexis P0.5 | Δ |
|---|---|---|---|---|
| Objective suitehand-crafted, 20 tasks | Coding, debugging, math, reasoning, adversarial disambiguation, instruction-following. Already saturated on base at 85% — no headroom for lift. | 17 / 20 (85.0%) | 17 / 20 (85.0%) | 0 pp |
| HumanEval-full155 tasks, HumanEval suite | Python function synthesis from docstring; code must pass all hidden test cases. Mainstream code-generation benchmark. | 42 / 155 (27.1%) | 74 / 155 (47.7%) | +20.6 pp |
| GSM8K-full200 tasks, GSM8K suite | Grade-school math word problems; multi-step arithmetic reasoning with numeric-tolerance verification. Mainstream math benchmark. | 90 / 200 (45.0%) | 180 / 200 (90.0%) | +45.0 pp |
| IFEval-lite15 tasks, IFEval-style | Strict instruction-compliance tasks with programmatic verifiers (word count, format, JSON, forbidden characters). | 5 / 15 (33.3%) | 6 / 15 (40.0%) | +6.7 pp |
| MBPP-full200 tasks, MBPP suite | Mostly-basic Python problems; code synthesis from natural-language spec with unit-test verification. | 79 / 200 (39.5%) | 77 / 200 (38.5%) | −1.0 pp |
| Identity10 probes, self-identification | "Who are you?" style probes that verify the Nexis identity stuck under the system prompt. Expected to move — it did. | 1 / 10 (10.0%) | 7 / 10 (70.0%) | +60.0 pp |
| Combined | All 600 held-out tasks, one pass per task, direct-answer mode. 95% Wilson CIs do not overlap. | 234 / 600 (39.0%) | 361 / 600 (60.2%) | +21.2 pp |
The headline +21.2 pp isn't concentrated in one suite — GSM8K moves +45, HumanEval moves +20.6, Identity moves +60, IFEval moves +6.7. MBPP drifts −1 (within noise). Objective is already saturated at 85%. The lift pattern matches what one would expect from 40,100 well-curated examples targeting code, math, and reasoning — not a single-task artifact.
Sample outputs
Three representative qualitative responses from the Phase 0.5 evaluation run, captured in full-thinking mode. Unedited — click to expand the prompt and the model's answer.
Code review bug identification
def binary_search(arr, target):
lo, hi = 0, len(arr)
while lo < hi:
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid
else:
hi = mid
return -1The most important bug is in the update of lo and hi. When arr[mid] < target, lo should be updated to mid + 1, not mid. Similarly, when arr[mid] > target, hi should be updated to mid - 1, not mid. The current implementation can lead to an infinite loop if the target is not found in the array.
Multi-step reasoning math / break-even
To break even, the company needs to cover its fixed costs and variable costs with its revenue. Let's denote the number of widgets sold as x.
Total cost: TC = FC + VC = $10,000 + ($3 × x)
Revenue: R = P × x = $8 × x
At break-even, R = TC:
$8x = $10,000 + $3x
$5x = $10,000
x = 2,000
Final answer: 2,000 widgets per month.
Adversarial disambiguation root-cause ranking
The most likely explanations in order:
- Python environment changed. Different virtual environment, different conda env, different Python version, or IDE swapped interpreter. Diagnose with
python --version,which python, andecho $VIRTUAL_ENV. - numpy not installed in the current environment. Check with
pip show numpyorpython -m pip show numpy. Install withpip install numpyif missing. - Python path issues.
sys.pathmay have changed. Print it withpython -c "import sys; print(sys.path)". - Corrupted installation. Reinstall:
pip uninstall numpy && pip install numpy. - Shadowed module. A local file named
numpy.pycan shadow the import.
Quick diagnostic sequence: python --version → pip show numpy → sys.path → reinstall if needed.
Method
Training
BF16 LoRA (rank 32, alpha 32, dropout 0) for two effective epochs on 40,100 curated records. Per-device batch 4 × gradient accumulation 8 = effective batch 32. Cosine LR schedule peaked at 1e-4 with 3% warmup, paged AdamW 8-bit optimizer. Packing at 4,096 tokens. Warm-started from the Phase 0.5 smoke-run adapter; total 230 steps on Modal H200 141 GB.
Dataset
40,100 records from eleven public, human-written, commercially-licensed sources: OASST2, CommitPackFT, TACO, CodeContests, GSM8K, MATH, NaturalProofs, OpenBookQA, Dolly 15k, Stack Exchange, NuminaMath-1.5. Filtered through an eight-stage pipeline with Aho-Corasick contamination checks against every held-out benchmark test set. Zero synthetic or AI-generated data.
Evaluation
600 programmatically-verified tasks per phase across six suites (objective, humaneval_full, gsm8k_full, ifeval, mbpp_full, identity) plus five qualitative probes. Base and tuned run under identical sampling config (temp 0.2, top-p 0.9). Wilson score 95% intervals computed for statistical significance; non-overlapping intervals are a positive lift signal at n=600.
Reproducibility
The tuned adapter ships to HuggingFace (tlxplays/nexis-35b-a3b-checkpoints). Task definitions, runner script, verifiers, and raw result JSONs are all in the repo. The entire Modal pipeline — training, merge, quantize, bench, A/B report generation — is scripted end-to-end and re-runnable against any future checkpoint.
What Phase 0.5 actually proves
- The training signal moves held-out capability. Unlike Phase 0 (525 examples, ~parity on OOD benchmarks), Phase 0.5 moves GSM8K by +45 pp, HumanEval by +20.6 pp, and the combined pass rate by +21.2 pp — all on held-out benchmarks the training data does not contain. That is a capability lift, not a distribution-fit artifact.
- Wilson 95% CIs do not overlap. [35.2, 43.0] for base vs [56.2, 64.0] for tuned, on n=600. The result is statistically distinguishable at p<0.05 — not a chance observation.
- The S+ corpus curation protocol is validated. 40,100 records, adversarial-hardened across three review rounds, zero synthetic data, Aho-Corasick decontamination against eight benchmark test sets. No leakage evidence: if the model had seen test answers, the delta would come from lexical overlap — but the structure of the gains (math and code generation, not exact-match keywords) is consistent with genuine capability improvement.
- MBPP −1 pp is within noise. On a 200-task benchmark, a 1-percentage-point swing is inside the natural batch-variance envelope for this class of model. No regression worth reporting.
- Quantization gap was controlled. Both base and tuned were benched at GGUF Q4_K_M after identical quantize steps. The comparison is apples-to-apples at the deployment precision.
What comes after Phase 0.5
A small training run,
documented transparently
This was not a capability-gain campaign. 525 curated examples, 1 epoch of LoRA fine-tuning — deliberately small. The goal was to validate the end-to-end pipeline: data curation → training → merge → quantize → serve → benchmark. What follows is the full record of that validation, including a head-to-head against the unmodified base model.
Base vs Nexis — head-to-head on 60 held-out tasks
Both models were served locally at the same quantization (GGUF Q4_K_M, Vulkan, 12-layer GPU offload), with identical sampling config and prompt format. Every task is verified programmatically — code execution, numeric tolerance, or rule-based format checks. These are the raw numbers.
| Suite | What it measures | Base | Nexis R1 | Δ |
|---|---|---|---|---|
| Objective suitehand-crafted, 20 tasks | Coding, debugging, math, reasoning, adversarial disambiguation, instruction-following. Tests the categories the training data targeted. | 18 / 20 (90%) | 17 / 20 (85%) | −5 pp |
| HumanEval-lite10 tasks from the HumanEval suite | Python function synthesis from docstring. Model must generate code that passes all hidden test cases. Mainstream code-generation benchmark. | 10 / 10 (100%) | 10 / 10 (100%) | 0 pp |
| GSM8K-lite15 tasks from GSM8K | Grade-school math word problems. Multi-step arithmetic reasoning. Mainstream math benchmark. | 4 / 15 (27%) | 4 / 15 (27%) | 0 pp |
| IFEval-lite15 tasks, IFEval-style | Strict instruction-compliance tasks with programmatic verifiers (word count, format, JSON structure, forbidden characters). The benchmark most aligned with the training data. | 11 / 15 (73%) | 11 / 15 (73%) | 0 pp |
| Combined | All 60 held-out tasks, one pass per task, direct-answer mode. | 43 / 60 (71.7%) | 42 / 60 (70.0%) | −1.7 pp |
The −1.7 pp combined delta comes from one single task — a word-count function where the tuned model produced a semantically correct Python dict in a different insertion order than the test expected. Remove that one edge case and the two models are indistinguishable on every single one of the remaining 59 tasks. For Phase 0 at this scale, that is exactly the expected outcome: in-distribution fit moves, out-of-distribution capability does not.
Sample outputs
Three representative responses from the full-reasoning evaluation run. These are unedited outputs — click any to expand the prompt and the model's answer.
Multi-step reasoning math / break-even
To find the break-even volume, I'll use the break-even formula where total revenue equals total costs.
Given:
Fixed costs (FC) = $10,000 per month
Variable cost per widget (VC) = $3
Selling price per widget (P) = $8
Break-even formula: At break-even, Total Revenue = Total Costs, so P × Q = FC + (VC × Q).
Solving:
8Q = 10,000 + 3Q
5Q = 10,000
Q = 2,000
Verification: Revenue at 2,000 widgets: 2,000 × $8 = $16,000. Total costs: $10,000 + (2,000 × $3) = $16,000. Revenue equals costs. ✓
Final answer: 2,000 widgets per month.
Architecture review pragmatic judgment
Recommendation: Start with a single Postgres database.
Why:
- Current load is trivial. 10 writes/sec and 200 reads/sec is well within Postgres's capabilities — Postgres routinely handles 10k+ writes/sec on modest hardware.
- CQRS adds significant complexity. Two data stores, eventual consistency handling, synchronization logic, and operational overhead of two databases.
- Most SaaS apps don't need CQRS. The pattern is most valuable when read and write workloads have fundamentally different characteristics. A 20:1 read-to-write ratio isn't extreme.
- Complexity cost. CQRS introduces eventual-consistency bugs, synchronization failures, harder debugging. For a typical SaaS app, that complexity often outweighs the performance benefits.
When to consider CQRS: read/write workloads have different scaling requirements; you need read models optimized for specific query patterns; high-throughput systems (10k+ writes/sec); separating domain logic from reporting/analytics.
Bottom line: Optimize for simplicity first. Postgres will handle your current load comfortably. Add CQRS only when you have a specific, measurable bottleneck that justifies the complexity.
Calibrated uncertainty risk-aware response
rm -rf /tmp/* on a Linux server as root? Answer with your confidence level and reasoning.Confidence Level: 95%
Safe aspects:
/tmpis specifically designed for temporary files.- Files in
/tmpshould be owned by the user who created them. - Most applications create temp files in
/tmpwith predictable naming. - The directory itself should not be deleted, only its contents.
Risks to consider:
- Running applications. If an application has a file open in
/tmp, the delete will succeed but the file handle remains valid. The application may crash or misbehave. - Special directories. Some systems use
/tmpfor database temporary files, build artifacts, application-specific runtime data. - Symlinks. If
/tmpcontains symlinks pointing outside,rm -rfwill follow them. - Race conditions. Between listing and deleting, new files may be created.
Safer alternatives: find /tmp -mindepth 1 -delete or stage into a mktemp -d directory first.
Conclusion: Generally safe on a server where you control the workload, but verify no critical services are actively using /tmp first. Always test on non-production first.
Method
Training
BF16 LoRA (rank 16, alpha 32) for one epoch on 525 curated records. Full-precision LoRA instead of 4-bit QLoRA — the H200's 141 GB VRAM provided enough headroom to skip dequantization entirely, cutting per-step latency by roughly 30%.
Dataset
Multi-teacher synthetic data curated across eight capability categories: coding, debugging, code review, multi-turn conversation, self-correction, adversarial disambiguation, math reasoning, and planning. Every example passes automated quality gating before inclusion — consistency checks, format validation, and content review.
Evaluation
Two complementary signals. Training-time held-out loss on a 58-record validation set gives the rigorous lift number. The Round 1 benchmark suite — 20 hand-crafted objective tasks with programmatic verifiers — gives the absolute capability score of the quantized local model.
Reproducibility
Task definitions, runner script, and raw results JSON all shipped with the model. Any future checkpoint can be re-scored against the same suite by pointing the runner at a different model endpoint. No proprietary evaluation harness.
What Phase 0 actually proves
- The pipeline works end-to-end. Data curation, BF16 LoRA training on H200, adapter merge into base weights, conversion to GGUF, 4-bit quantization, local serving, programmatic benchmarking — every stage completed successfully. That was the real goal.
- The training-time signal is real. On in-distribution held-out data, perplexity dropped 12.9%. The model demonstrably learned from the 525 training examples — just not enough to shift general-purpose capability on out-of-distribution benchmarks.
- Parity on OOD benchmarks is the expected outcome. Published SFT studies consistently show that sub-1,000 example, single-epoch fine-tunes move in-distribution metrics without meaningfully changing held-out task performance. This result matches that literature.
- No benchmark contamination concerns here. The tuned model did not mysteriously outperform base on leaderboard tasks. If there were data leakage, the delta would be positive. It isn't.
- Direct-answer constraint. All benchmarks were run with reasoning disabled (fast-inference mode). Enabling full chain-of-thought would likely improve math and multi-step tasks for both models; the A/B remains valid under identical configuration.
What comes after pipeline validation
Founder-led tech, built in
the West of Ireland
ByteMe is a tech-services company based in the West of Ireland, building the outsourced tech department that small and mid-sized businesses need but can't justify hiring in-house. We do the web work, the hosting, the server management, the integrations — and the things most agencies can't, like training and hosting custom AI models on your data.
The same engineering discipline used to build and ship a 35-billion-parameter language model goes into the websites and infrastructure we build for clients. Boring problems and frontier problems share the same root: do the work properly the first time.
Founder-led
You talk to the person doing the work. No account managers, no offshore handoffs, no ticket queues. The same engineer who scopes the project writes the code and answers the phone.
AI-capable
Most agencies sell "AI" by reselling someone else's API. We can do that — but we can also train and host custom models on your own data, on your own infrastructure. The capability is rare; we have it.
Outcome-focused
Fixed-fee work where we can scope it. Clear milestones, weekly progress, no scope creep without a conversation. You buy the result, not the hours.
Local & accountable
Irish company, Irish founder, GDPR-friendly by default. We can sit in your office if you want. When something breaks at 6pm on a Friday, we're not in a different time zone hoping you don't notice.
Start with a free 30-min call
No obligation. You tell us what's broken or what you're trying to build. We tell you whether we can help, and what it might cost. If we're not the right fit, we'll say so — and where possible, point you at someone who is.
Already know what you want? The Tech & AI Consultation is a 90-minute deep-dive that ends with a written 30 / 60 / 90-day roadmap — yours to keep, credited toward any project that follows.
dominic@byteme.ie