Outsourced Tech Department · AI-Capable

The tech department your business doesn't have — and the AI most agencies can't build

Web, hosting, infrastructure, and custom AI integration for Irish SMEs. We do the work an in-house tech team would do — without the headcount. And we have rare capability most agencies can't match: training and hosting your own AI models on your own data.

Anything tech your
business struggles with

Most Irish SMEs are too small for an in-house tech team but too dependent on tech to ignore the problems. We're the outsourced version — modular if you need one thing, full-stack if you need everything. Below is what we do today; we expand into adjacent areas as we grow.

Web Creation

Marketing sites, brochure sites, simple CMS, contact + analytics — built fast, designed to convert. Modern stack (Next.js / static), mobile-friendly by default, your content under your control.

Hosting & Maintenance

We host and maintain your site so you don't have to. Backups, SSL, security patching, content updates, uptime monitoring, monthly reporting. Predictable monthly fee.

Server & Cloud Infrastructure

Server provisioning, cloud migration, email + DNS + SSL automation, security hardening, monitoring, on-call patching. The plumbing your business runs on, kept boring on purpose.

Custom Software

Bespoke internal tools, dashboards, integrations between systems that don't talk to each other, automation of work that shouldn't be manual. If a person types it every day, we can probably eliminate it.

AI & Data

Custom AI integration where it actually saves time — document processing, internal Q&A on your own knowledge base, workflow automation. Hosted on your infra, on your data — not on someone else's API. Proof we can build this →

Outsourced Tech Department

All of the above, retained relationship, one point of contact for any tech need. You stop juggling vendors. We handle whatever lands. Monthly retainer, scoped to your business.

A ladder, not a contract trap

Every engagement starts the same way: a free 30-minute call. From there, you decide how far you want to go. No long contracts to sign just to get a conversation.

Step 01

Free discovery call

30 minutes, no charge, no commitment. You tell us what's broken or what you're trying to build. We tell you whether we can help, and what it might look like.

Step 02

Tech & AI Consultation

A 90-minute deep-dive. We audit what you have, identify the highest-leverage opportunities, and leave you with a written 30 / 60 / 90-day roadmap — yours to keep, even if you don't engage further.

Step 03

Project

A scoped engagement to deliver one or two of the things from the roadmap. Fixed-fee where possible; clear milestones; weekly progress updates. The consultation cost comes off the project.

Step 04

Outsourced Tech Dept

Once we've delivered, the natural step is a retainer. We become the team you call for anything tech. Predictable monthly fee, single point of contact, scoped to your business.

0
Cost of the first conversation. We'd rather know fast whether we're a fit than sell you anything you don't need.
30 / 60 / 90
Day roadmap delivered with every consultation. Yours to keep — implement with us, with someone else, or in-house.
1
Point of contact across every engagement. No ticket queue, no account-manager handoffs — you talk to the person doing the work.

When we say "AI-capable,"
we mean it

Most agencies sell "AI integration" by reselling someone else's API. We can do that — but we can also build, train, and host the model itself. Below: real measured gains from a 35-billion-parameter language model we trained ourselves. Full technical detail is one click away if you want it.

+21.2 pp
Combined pass-rate lift across 600 held-out tasks (39.0% → 60.2%). Wilson 95% confidence intervals do not overlap — statistically distinguishable.
+45 pp
Math benchmark (45% → 90%). A near-doubling on grade-school multi-step reasoning.
+20.6 pp
Code-generation benchmark (27% → 47.7%). Real lift on out-of-distribution code synthesis.
Optional View full benchmark detail, methodology & sample outputs

600 tasks. +21.2 pp lift.
Statistically dominant.

Same base model and pipeline as Phase 0 — but trained on the full 40,100-record S+ curated corpus for two epochs. This is a capability-gain campaign. Full A/B head-to-head against the unmodified base model, 1,200 tasks across six suites. Wilson 95% confidence intervals do not overlap.

Nexis model qwen3_5-35b-a3b-nexis-phase05 Base Qwen 3.5 35B-A3B · MoE + GDN Method BF16 LoRA · rank 32 · 2 epochs Training data 40,100 S+ curated records · 11 public sources · Apache/MIT/CC Hardware Modal H200 141 GB Training time 2 h 41 m · 230 steps Eval quantization GGUF Q4_K_M · 19.7 GB Date 2026-04-18
+21.2 pp
Combined pass-rate lift on 600 held-out tasks: 39.0% → 60.2%. Wilson 95% CIs [35.2, 43.0] vs [56.2, 64.0] — do not overlap, statistically distinguishable
90.0%
GSM8K math pass rate, up from 45.0% on base — near-doubling on a held-out math benchmark. Matches what published SFT studies predict when training data crosses the scale threshold
47.7%
HumanEval code-generation pass rate, up from 27.1% on base — real code-synthesis lift on the mainstream Python benchmark, not just in-distribution fit

Base vs Nexis Phase 0.5 — head-to-head on 600 held-out tasks

Both models served from the Modal volume at the same quantization (GGUF Q4_K_M, llama.cpp CUDA, all layers on H200) with identical sampling config. Every task programmatically verified — code execution, numeric tolerance, regex, format rules. These are the raw numbers. The base model runs without a system prompt; the tuned model runs with the Nexis system prompt that matches production serving.

Suite What it measures Base Nexis P0.5 Δ
Objective suitehand-crafted, 20 tasks Coding, debugging, math, reasoning, adversarial disambiguation, instruction-following. Already saturated on base at 85% — no headroom for lift. 17 / 20 (85.0%) 17 / 20 (85.0%) 0 pp
HumanEval-full155 tasks, HumanEval suite Python function synthesis from docstring; code must pass all hidden test cases. Mainstream code-generation benchmark. 42 / 155 (27.1%) 74 / 155 (47.7%) +20.6 pp
GSM8K-full200 tasks, GSM8K suite Grade-school math word problems; multi-step arithmetic reasoning with numeric-tolerance verification. Mainstream math benchmark. 90 / 200 (45.0%) 180 / 200 (90.0%) +45.0 pp
IFEval-lite15 tasks, IFEval-style Strict instruction-compliance tasks with programmatic verifiers (word count, format, JSON, forbidden characters). 5 / 15 (33.3%) 6 / 15 (40.0%) +6.7 pp
MBPP-full200 tasks, MBPP suite Mostly-basic Python problems; code synthesis from natural-language spec with unit-test verification. 79 / 200 (39.5%) 77 / 200 (38.5%) −1.0 pp
Identity10 probes, self-identification "Who are you?" style probes that verify the Nexis identity stuck under the system prompt. Expected to move — it did. 1 / 10 (10.0%) 7 / 10 (70.0%) +60.0 pp
Combined All 600 held-out tasks, one pass per task, direct-answer mode. 95% Wilson CIs do not overlap. 234 / 600 (39.0%) 361 / 600 (60.2%) +21.2 pp

The headline +21.2 pp isn't concentrated in one suite — GSM8K moves +45, HumanEval moves +20.6, Identity moves +60, IFEval moves +6.7. MBPP drifts −1 (within noise). Objective is already saturated at 85%. The lift pattern matches what one would expect from 40,100 well-curated examples targeting code, math, and reasoning — not a single-task artifact.

Sample outputs

Three representative qualitative responses from the Phase 0.5 evaluation run, captured in full-thinking mode. Unedited — click to expand the prompt and the model's answer.

Code review bug identification
Prompt
Review this Python function and point out the single most important bug. Be concise.

def binary_search(arr, target):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid
        else:
            hi = mid
    return -1
Model response

The most important bug is in the update of lo and hi. When arr[mid] < target, lo should be updated to mid + 1, not mid. Similarly, when arr[mid] > target, hi should be updated to mid - 1, not mid. The current implementation can lead to an infinite loop if the target is not found in the array.

Multi-step reasoning math / break-even
Prompt
A company makes widgets. Fixed costs are $10,000 per month. Variable cost per widget is $3. They sell widgets for $8 each. At what monthly volume do they break even? Show your reasoning then give the final number.
Model response

To break even, the company needs to cover its fixed costs and variable costs with its revenue. Let's denote the number of widgets sold as x.

Total cost: TC = FC + VC = $10,000 + ($3 × x)
Revenue: R = P × x = $8 × x

At break-even, R = TC:
$8x = $10,000 + $3x
$5x = $10,000
x = 2,000

Final answer: 2,000 widgets per month.

Adversarial disambiguation root-cause ranking
Prompt
A user asks: 'My Python script that worked yesterday now crashes with ImportError: No module named numpy. Nothing changed. Why?' — identify the most likely explanations and how to diagnose, in order of likelihood.
Model response

The most likely explanations in order:

  1. Python environment changed. Different virtual environment, different conda env, different Python version, or IDE swapped interpreter. Diagnose with python --version, which python, and echo $VIRTUAL_ENV.
  2. numpy not installed in the current environment. Check with pip show numpy or python -m pip show numpy. Install with pip install numpy if missing.
  3. Python path issues. sys.path may have changed. Print it with python -c "import sys; print(sys.path)".
  4. Corrupted installation. Reinstall: pip uninstall numpy && pip install numpy.
  5. Shadowed module. A local file named numpy.py can shadow the import.

Quick diagnostic sequence: python --versionpip show numpysys.path → reinstall if needed.

Method

Training

BF16 LoRA (rank 32, alpha 32, dropout 0) for two effective epochs on 40,100 curated records. Per-device batch 4 × gradient accumulation 8 = effective batch 32. Cosine LR schedule peaked at 1e-4 with 3% warmup, paged AdamW 8-bit optimizer. Packing at 4,096 tokens. Warm-started from the Phase 0.5 smoke-run adapter; total 230 steps on Modal H200 141 GB.

Dataset

40,100 records from eleven public, human-written, commercially-licensed sources: OASST2, CommitPackFT, TACO, CodeContests, GSM8K, MATH, NaturalProofs, OpenBookQA, Dolly 15k, Stack Exchange, NuminaMath-1.5. Filtered through an eight-stage pipeline with Aho-Corasick contamination checks against every held-out benchmark test set. Zero synthetic or AI-generated data.

Evaluation

600 programmatically-verified tasks per phase across six suites (objective, humaneval_full, gsm8k_full, ifeval, mbpp_full, identity) plus five qualitative probes. Base and tuned run under identical sampling config (temp 0.2, top-p 0.9). Wilson score 95% intervals computed for statistical significance; non-overlapping intervals are a positive lift signal at n=600.

Reproducibility

The tuned adapter ships to HuggingFace (tlxplays/nexis-35b-a3b-checkpoints). Task definitions, runner script, verifiers, and raw result JSONs are all in the repo. The entire Modal pipeline — training, merge, quantize, bench, A/B report generation — is scripted end-to-end and re-runnable against any future checkpoint.

What Phase 0.5 actually proves

  • The training signal moves held-out capability. Unlike Phase 0 (525 examples, ~parity on OOD benchmarks), Phase 0.5 moves GSM8K by +45 pp, HumanEval by +20.6 pp, and the combined pass rate by +21.2 pp — all on held-out benchmarks the training data does not contain. That is a capability lift, not a distribution-fit artifact.
  • Wilson 95% CIs do not overlap. [35.2, 43.0] for base vs [56.2, 64.0] for tuned, on n=600. The result is statistically distinguishable at p<0.05 — not a chance observation.
  • The S+ corpus curation protocol is validated. 40,100 records, adversarial-hardened across three review rounds, zero synthetic data, Aho-Corasick decontamination against eight benchmark test sets. No leakage evidence: if the model had seen test answers, the delta would come from lexical overlap — but the structure of the gains (math and code generation, not exact-match keywords) is consistent with genuine capability improvement.
  • MBPP −1 pp is within noise. On a 200-task benchmark, a 1-percentage-point swing is inside the natural batch-variance envelope for this class of model. No regression worth reporting.
  • Quantization gap was controlled. Both base and tuned were benched at GGUF Q4_K_M after identical quantize steps. The comparison is apples-to-apples at the deployment precision.

What comes after Phase 0.5

Phase 1

Qwen 3.5 397B-A17B QLoRA

Same pipeline, same data protocol, scaled to the 397B-parameter mixture-of-experts model. Infrastructure procurement pending; target configuration 8× A100 80 GB or equivalent H100/H200 cluster. The Phase 0.5 lift translates with parameter count; the goal is frontier-competitive capability on open weights.

Soon

Preference optimization (DPO/ORPO)

Preference pair generation infrastructure is the next bottleneck. The S+ corpus gives us the SFT foundation; a DPO round on top should move the capability envelope further without retraining from scratch.

A small training run,
documented transparently

This was not a capability-gain campaign. 525 curated examples, 1 epoch of LoRA fine-tuning — deliberately small. The goal was to validate the end-to-end pipeline: data curation → training → merge → quantize → serve → benchmark. What follows is the full record of that validation, including a head-to-head against the unmodified base model.

Nexis model qwen3_5-35b-a3b-nexis Base Qwen 3.5 35B-A3B · MoE + GDN Method BF16 LoRA · 1 epoch Training data 525 curated records Hardware Lightning AI H200 141 GB Training time 18 minutes Eval quantization GGUF Q4_K_M · 20 GB Date 2026-04-15
−12.9%
Perplexity on in-distribution held-out set (2.94 → 2.57) — the model fits its training distribution better, as expected
60
Held-out benchmark tasks run on both base and tuned — four suites including HumanEval-lite, GSM8K-lite, IFEval-lite, plus a hand-crafted 20-task capability suite
~parity
Capability on out-of-distribution benchmarks. Tuned 70%, base 72%. Exactly what the literature predicts for 525 examples × 1 epoch of SFT

Base vs Nexis — head-to-head on 60 held-out tasks

Both models were served locally at the same quantization (GGUF Q4_K_M, Vulkan, 12-layer GPU offload), with identical sampling config and prompt format. Every task is verified programmatically — code execution, numeric tolerance, or rule-based format checks. These are the raw numbers.

Suite What it measures Base Nexis R1 Δ
Objective suitehand-crafted, 20 tasks Coding, debugging, math, reasoning, adversarial disambiguation, instruction-following. Tests the categories the training data targeted. 18 / 20 (90%) 17 / 20 (85%) −5 pp
HumanEval-lite10 tasks from the HumanEval suite Python function synthesis from docstring. Model must generate code that passes all hidden test cases. Mainstream code-generation benchmark. 10 / 10 (100%) 10 / 10 (100%) 0 pp
GSM8K-lite15 tasks from GSM8K Grade-school math word problems. Multi-step arithmetic reasoning. Mainstream math benchmark. 4 / 15 (27%) 4 / 15 (27%) 0 pp
IFEval-lite15 tasks, IFEval-style Strict instruction-compliance tasks with programmatic verifiers (word count, format, JSON structure, forbidden characters). The benchmark most aligned with the training data. 11 / 15 (73%) 11 / 15 (73%) 0 pp
Combined All 60 held-out tasks, one pass per task, direct-answer mode. 43 / 60 (71.7%) 42 / 60 (70.0%) −1.7 pp

The −1.7 pp combined delta comes from one single task — a word-count function where the tuned model produced a semantically correct Python dict in a different insertion order than the test expected. Remove that one edge case and the two models are indistinguishable on every single one of the remaining 59 tasks. For Phase 0 at this scale, that is exactly the expected outcome: in-distribution fit moves, out-of-distribution capability does not.

Sample outputs

Three representative responses from the full-reasoning evaluation run. These are unedited outputs — click any to expand the prompt and the model's answer.

Multi-step reasoning math / break-even
Prompt
A company makes widgets. Fixed costs are $10,000 per month. Variable cost per widget is $3. They sell widgets for $8 each. At what monthly volume do they break even? Show your reasoning then give the final number.
Model response

To find the break-even volume, I'll use the break-even formula where total revenue equals total costs.

Given:
Fixed costs (FC) = $10,000 per month
Variable cost per widget (VC) = $3
Selling price per widget (P) = $8

Break-even formula: At break-even, Total Revenue = Total Costs, so P × Q = FC + (VC × Q).

Solving:
8Q = 10,000 + 3Q
5Q = 10,000
Q = 2,000

Verification: Revenue at 2,000 widgets: 2,000 × $8 = $16,000. Total costs: $10,000 + (2,000 × $3) = $16,000. Revenue equals costs. ✓

Final answer: 2,000 widgets per month.

Architecture review pragmatic judgment
Prompt
Briefly compare: using a single relational database (Postgres) vs. a CQRS pattern with separate read/write stores, for a SaaS product with ~100k users and ~10 writes/sec, ~200 reads/sec. What's your recommendation and why?
Model response

Recommendation: Start with a single Postgres database.

Why:

  • Current load is trivial. 10 writes/sec and 200 reads/sec is well within Postgres's capabilities — Postgres routinely handles 10k+ writes/sec on modest hardware.
  • CQRS adds significant complexity. Two data stores, eventual consistency handling, synchronization logic, and operational overhead of two databases.
  • Most SaaS apps don't need CQRS. The pattern is most valuable when read and write workloads have fundamentally different characteristics. A 20:1 read-to-write ratio isn't extreme.
  • Complexity cost. CQRS introduces eventual-consistency bugs, synchronization failures, harder debugging. For a typical SaaS app, that complexity often outweighs the performance benefits.

When to consider CQRS: read/write workloads have different scaling requirements; you need read models optimized for specific query patterns; high-throughput systems (10k+ writes/sec); separating domain logic from reporting/analytics.

Bottom line: Optimize for simplicity first. Postgres will handle your current load comfortably. Add CQRS only when you have a specific, measurable bottleneck that justifies the complexity.

Calibrated uncertainty risk-aware response
Prompt
Is it safe to run rm -rf /tmp/* on a Linux server as root? Answer with your confidence level and reasoning.
Model response

Confidence Level: 95%

Safe aspects:

  • /tmp is specifically designed for temporary files.
  • Files in /tmp should be owned by the user who created them.
  • Most applications create temp files in /tmp with predictable naming.
  • The directory itself should not be deleted, only its contents.

Risks to consider:

  • Running applications. If an application has a file open in /tmp, the delete will succeed but the file handle remains valid. The application may crash or misbehave.
  • Special directories. Some systems use /tmp for database temporary files, build artifacts, application-specific runtime data.
  • Symlinks. If /tmp contains symlinks pointing outside, rm -rf will follow them.
  • Race conditions. Between listing and deleting, new files may be created.

Safer alternatives: find /tmp -mindepth 1 -delete or stage into a mktemp -d directory first.

Conclusion: Generally safe on a server where you control the workload, but verify no critical services are actively using /tmp first. Always test on non-production first.

Method

Training

BF16 LoRA (rank 16, alpha 32) for one epoch on 525 curated records. Full-precision LoRA instead of 4-bit QLoRA — the H200's 141 GB VRAM provided enough headroom to skip dequantization entirely, cutting per-step latency by roughly 30%.

Dataset

Multi-teacher synthetic data curated across eight capability categories: coding, debugging, code review, multi-turn conversation, self-correction, adversarial disambiguation, math reasoning, and planning. Every example passes automated quality gating before inclusion — consistency checks, format validation, and content review.

Evaluation

Two complementary signals. Training-time held-out loss on a 58-record validation set gives the rigorous lift number. The Round 1 benchmark suite — 20 hand-crafted objective tasks with programmatic verifiers — gives the absolute capability score of the quantized local model.

Reproducibility

Task definitions, runner script, and raw results JSON all shipped with the model. Any future checkpoint can be re-scored against the same suite by pointing the runner at a different model endpoint. No proprietary evaluation harness.

What Phase 0 actually proves

  • The pipeline works end-to-end. Data curation, BF16 LoRA training on H200, adapter merge into base weights, conversion to GGUF, 4-bit quantization, local serving, programmatic benchmarking — every stage completed successfully. That was the real goal.
  • The training-time signal is real. On in-distribution held-out data, perplexity dropped 12.9%. The model demonstrably learned from the 525 training examples — just not enough to shift general-purpose capability on out-of-distribution benchmarks.
  • Parity on OOD benchmarks is the expected outcome. Published SFT studies consistently show that sub-1,000 example, single-epoch fine-tunes move in-distribution metrics without meaningfully changing held-out task performance. This result matches that literature.
  • No benchmark contamination concerns here. The tuned model did not mysteriously outperform base on leaderboard tasks. If there were data leakage, the delta would be positive. It isn't.
  • Direct-answer constraint. All benchmarks were run with reasoning disabled (fast-inference mode). Enabling full chain-of-thought would likely improve math and multi-step tasks for both models; the A/B remains valid under identical configuration.

What comes after pipeline validation

Round 2

ORPO with 2–4k preference pairs

Same base model, same pipeline, but with preference optimization folded into the training loss. 2–4× the data and a technique that actually moves out-of-distribution capability. Single H100, ~20 hours wall time.

Later

Production training at 397B parameters

Same pipeline scaled to a 397B-parameter mixture-of-experts model. Apache 2.0 licensed, frontier-competitive on open weights. Demonstrates capability we can put behind a client engagement when the use case justifies it.

Next.js Cloudflare AWS GCP Azure Docker Linux Postgres Python TypeScript vLLM PyTorch HuggingFace LoRA / DPO Mixture-of-Experts FP8 Inference

Founder-led tech, built in
the West of Ireland

ByteMe is a tech-services company based in the West of Ireland, building the outsourced tech department that small and mid-sized businesses need but can't justify hiring in-house. We do the web work, the hosting, the server management, the integrations — and the things most agencies can't, like training and hosting custom AI models on your data.

The same engineering discipline used to build and ship a 35-billion-parameter language model goes into the websites and infrastructure we build for clients. Boring problems and frontier problems share the same root: do the work properly the first time.

D
Dominic Lynch
Founder · Galway

Founder-led

You talk to the person doing the work. No account managers, no offshore handoffs, no ticket queues. The same engineer who scopes the project writes the code and answers the phone.

AI-capable

Most agencies sell "AI" by reselling someone else's API. We can do that — but we can also train and host custom models on your own data, on your own infrastructure. The capability is rare; we have it.

Outcome-focused

Fixed-fee work where we can scope it. Clear milestones, weekly progress, no scope creep without a conversation. You buy the result, not the hours.

Local & accountable

Irish company, Irish founder, GDPR-friendly by default. We can sit in your office if you want. When something breaks at 6pm on a Friday, we're not in a different time zone hoping you don't notice.

Start with a free 30-min call

No obligation. You tell us what's broken or what you're trying to build. We tell you whether we can help, and what it might cost. If we're not the right fit, we'll say so — and where possible, point you at someone who is.

Already know what you want? The Tech & AI Consultation is a 90-minute deep-dive that ends with a written 30 / 60 / 90-day roadmap — yours to keep, credited toward any project that follows.

dominic@byteme.ie