Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105

AUTOMATIC RECALIBRATION OF QUANTUM DEVICES BY REINFORCING

LEARNING

T. Crosta*1, L. Rebón2, F. Vilariño1,3, J. M. Matera4and M. Bilkis1

1Computer Vision Center (CVC), 08193 Bellaterra (Cerdanyola del Vallès), Spain

2Instituto de Física La Plata (IFLP), CONICET - UNLP, and Departamento de Ciencias Básicas, Facultad de Ingeniería, Universidad

Nacional de La Plata (UNLP), La Plata 1900, Argentina

3Department of Computer Science, Universitat Autónoma de Barcelona (UAB), 8193 Bellaterra (Cerdanyola del Vallès), Spain.

4IFLP-CONICET, Departamento de Física, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, C.C. 67, La Plata 1900,

Argentina

Recibido: 03/05/2025 ; Aceptado: 09/09/2025

During their operation, due to shifts in environmental conditions, devices undergo various forms of detuning from their

optimal settings. Typically, this is addressed through control loops, which monitor variables and the device performance,

to maintain settings at their optimal values. Quantum devices are particularly challenging since their functionality relies

on precisely tuning their parameters. At the same time, the detailed modeling of the environmental behavior is often

computationally unaffordable, while a direct measure of the parameters deﬁning the system state is costly and introduces

extra noise in the mechanism. In this study, we investigate the application of reinforcement learning techniques to

develop a model-free control loop for continuous recalibration of quantum device parameters. Furthermore, we explore

the advantages of incorporating minimal environmental noise models. As an example, the application to numerical

simulations of a Kennedy receiver-based long-distance quantum communication protocol is presented.

Keywords: Quantum Machine Learning, Quantum Control, Automatic re-calibration, Kenedy receiver

https://doi.org/10.31527/analesafa.2025.36.4.95-105 ISSN - 1850-1168 (online)

*tomycrosta@gmail.com

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 95

I. INTRODUCTION

Calibrating an experimental apparatus is a primitive and ubiquitous task in most areas of science and technology. In

turn, sensor and detector devices constitute the way to extract information about the environment surrounding us and

better understand reality via further post-processing of the acquired data. Thus, fully calibrating experimental devices is

a primordial task and, in turn, an active research topic [1-11]. In this manuscript, we study the recurrent calibration of

devices whose deployment environment is challenging to be modelled. Examples of this are scenarios that heavily vary

with time in a way that is hard to predict, e.g. turbulent atmosphere [12-17], hydrological models [18,19] or non-isolated

magnetometers [20,21] among many others. For such settings, where state-of-the art technology is being used to push

forward the boundaries of scientiﬁc discoveries at a considerable resource overhead, it is of utmost importance to develop

techniques that are ready to adapt the device conﬁguration to the experimental condition at hand. In this regard, a plethora

of artiﬁcial-intelligence techniques have recently been developed in the context of sensor calibration [1,2,8,22-29],

change-point detection [30-32] and malfunctioning device identiﬁcation [33-38]. Our main contribution is to provide a

framework for re-calibrating quantum devices. Based on this, we present the method applied to a quantum-classical long-

distance communication by laser pulses, e.g. satellite-ground or optical-ﬁber communication. Overall, the success of most

machine-learning (re)calibration schemes considered in literature rely either on perfect knowledge of device’s functioning

condition, access to huge amount of data for training purposes or limiting the dynamics of the system. Such assumptions

constitute a double-edged sword when deploying the device on (potentially adversary) experimental conditions: while

correct conﬁgurations can be granted if the machine-learning model was trained on data resembling the experimental

conditions, there is a high probability of remaining off-calibrated otherwise.

Here, we depart from such a notion of similarity between training and deployment scenarios, by considering a hybrid

scheme consisting of a pre-training round complemented with a reinforcement-learning stage. The latter ﬁne-tunes the

conﬁguration, so the device can be adapted to the speciﬁc (and potentially unexplored) experimental conditions at hand;

this is done by modifying device controls, as shown in Fig. 1.

The success of our method hinges on the capabilities of devising an approximate model of the setting’s dependence with

respect to changes in its surroundings (which we indistinctly call environment). Such approximate model is to be thought

as a simpliﬁed description of the environment, e.g. captured by very few variables. While not expected to be fully accurate

— not retrieving the exact device conﬁguration for each speciﬁc experimental condition—, it shall be thought of as ansatz

for controls initialization. The accuracy of this initialization relies on the capacity of the approximated model to capture

relevant features of device behavior given the experimental condition at hand. As a rule of thumb, the more complex

such ansatz, the more accurate the description is expected to be. However, a trade-off rises; While more complex models

tend to be data-consuming until reaching optimal calibrations, tailoring the model to speciﬁc experimental conditions will

inherently induce a bias towards a sub-set of deployment scenarios. Thus, the goal of the pre-training round is to suggest

control initialization values by using a small number of quantities that can easily be estimated out of few experiments. The

control values are then improved by means of a complementary reinforcement-learning method, which adapts the control

values to the speciﬁc experimental condition in a model-free way. On top of the calibration mechanism, the value of a

decalibration witness is continuously monitored during deployment, which allows the agent to experimentally detect that

the device entered an off-calibration stage, and thus re-initiate the calibration process. This work is a step further towards

developing a fully automatic re-calibration of quantum detectors through machine-learning techniques. Importantly, we

remark that neither the framework nor the method is specially biased towards the quantum realm, and can potentially be

applied to other control problems beyond the quantum-technology scope.

The manuscript is structured as follows. In Sec. II we present our re-calibration framework and described our method.

In Sec. III we numerically analyze the performance of our re-calibration method in an emblematic long-distance quantum

communication setting. Conclusions and future work are outlined in Sec.IV

II. The re-calibration framework

We consider a device whose controls are deﬁned by continuous parameters θ

θ={θ1,...,θM}. As shown in Fig.1, our

setting is a black-box device controlled by different knobs k=1, ...,M, each associated to a control value θk. In the

following, we deﬁne several quantities of interest.

FIG. 1: We depict a device that needs to be calibrated. Here, the apparatus is controlled by different knobs deﬁned by values θ

θ=

{θ1,...,θM}, and the aim is to tune such parameters in a way that the device is conﬁgured to optimally operate under experimental

conditions E.

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 96

Device conﬁguration. A ﬁxed set of parameter values θ

θcompletely deﬁnes a device conﬁguration.

Score function. The quality of a device conﬁguration θ

θis evaluated by a score function SE(θ

θ). The value of the score

function can be estimated—during a calibration stage— by means of Nrepeated experiments; each experiment iinvolves

a quantum measurement and leads to a measurement outcome n

ni, whose value is generally of stochastic nature. Here, the

full underlying model needed to describe outcomes probability distributions is denoted by E, and generally involves an

accurate description of noisy channels present in the setting at hand.

Effective score function. The underlying model Eis generally inaccessible to the calibrating agent, and hence shall as-

sume to be unknown to it. This is motivated by the fact that: (i) time-varying deployment conditions can be fundamentally

hard to model, and (ii) even in the case of having full control of experimental conditions, quantum channel-tomography

comes with a considerable sample overhead, implying that the total number of experiments and parameters required to

reach near-optimal environment modelling (plus device calibration) would grow exponentially or be otherwise constrai-

ned to speciﬁc scenarios [39-42]. On the contrary, we do assume that a certain relationship exists between the true score

function SE(θ

θ)and its effective version S ˜

E(θ

θ),e.g. an effective model ˜

Eused by the agent is indeed able to capture

certain relevant features of the score function. Effective models should be thought as an enhanced control initialization

strategy. This notion applies for the case in which the device enters an off-calibration stage, and new control values should

be found.

Reinforcement Learning (RL). The setting described above can be framed in the RL language [10,22,43-46]. , where

an agent repeatedly interacts with an environment in order to maximize a reward function, during different episodes.

Here, at ith episode (experiment), the agent selects parameter values θ

θ, observes measurement outcomes n

ni, and ﬁnally

post-processes them in order to provide a claim for the underlying task the quantum device is used for. Based on the

accuracy of this ﬁnal action, the agent is given a reward signal, which uses to improve its estimate on how valuable the

decisions performed were. In RL, this is captured by the so-called state-action value-function Qπ(s,a)[10], standing for

the expected reward when departing from state sand taking action a(i.e. either selecting parameters θ

θor providing a

claim based on the outcomes acquired [10,43]), and following decision criteria — or policy— π. For an optimal device

usage, the agent shall choose conﬁgurations θ

θ∗leading to a maximum score SE(θ

θ). Nonetheless, since Eis not available

to the agent, value functions need to be estimated out of several experiment repetitions. Importantly, the agent’s strategy

is optimized solely based on the rewards acquired during learning. Here, not only such rewards are a way to estimate

value-functions, but also serve as a lighthouse for the agent to navigate the decision landscape, allowing a model-free

calibration of the device. We provide further details on how model-free calibration works in Appendix A.

As an example, we consider a single-control device, whose score function SE(θ

θ)is schematized in Fig. 2. A model-

free agent would initially set the parameter θat random and consequently estimate its score function out of repeated

experiments. On the contrary, keeping an effective model ˜

Ecan readily help the agent to improve such initialization

strategy. Here, the agent’s internal model S˜

E(θ

θ)serves as an ansatz for the underlying behavior of score SE(θ

θ)w.r.t. the

control θ. Intuitively, the internal model ˜

Eis expected to be easier to estimate out of few experiment repetitions.

RL methods have recently been applied to a wide variety of quantum technology scenarios, among them calibrating a

quantum communication setting [10,11,47-51], optimizing quantum pulses [22,52-54], quantum gated-circuit layout [55,

56], and even graph-processing applications [57], to name a few. However, little has been investigated in the capabilities of

the learning model to adapt the calibration to changes in the environment Ehappening while the device is being used, e.g.

in the deployment stage [58,59]. In Fig. 2(bottom) we exemplify how a change in the environment would affect the score

function, requiring a recalibration. In order to detect the new landscape, we can consider that the new observations will be

different from the ones predicted by the previous exploration. Thus, having indications about changes in the environment

even during off calibration stages.

Decalibration witness. In order to realize that a change occurred in the environment, the agent must rely on an experi-

mentally accessible quantity, which we deﬁne as decalibration witness and denote with Wd. By monitoring the behavior

of Wdover different experiments, the agent can readily detect whether a change-point occurs in the environment and thus

re-start the calibration routine if anomalies are detected. Examples of potential decalibration witnesses are estimates of

outcome probabilities, which the agent can straightforwardly construct from the information acquired during previous

experiments.

Automatic re-calibration. The deﬁnitions outlined above set a framework to analyze the recurrent calibration of a device.

We now turn to describe our automatic re-calibration method, which makes use of effective score functions, RL routines

and decalibration witnesses. Here, we picture a scenario where the device is to be initially calibrated and, while it is being

deployed, the device enters an off-calibration stage which needs to be compensated. The quality of a given conﬁguration

is measured by a score function — which in turn depends on the current experimental conditions—; the maximum of the

latter encodes the solution to the problem for which the device is being used for.

For instance, in a communication setting, the device conﬁguration is deﬁned by the encoding-decoding strategy (e.g.

the quantum measurement performed to decode information out of the incoming signal), and the score function is given

by the success probability of the protocol. Alternatively, in variational quantum computing applications [60,61], i.e.

the VQE algorithm [62], the device conﬁguration is deﬁned by the free parameters of the parametrized quantum circuit

and the score function is given by the energy landscape, which needs to be estimated out of several repetitions of an

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 97

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.2

0.4

0.6

0.8

1.0

( )

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.2

0.4

0.6

0.8

1.0

0( )

1( )

FIG. 2: Single-parameter device example. Top panel: the optimal calibration score SE(θ

θ)is shown (dashed-red vertical line), and its

effective value S ˜

E(θ

θ)(blue-dashed vertical line); while suboptimal, this value is further ﬁne-tuned by means of a model-free scheme

(see main body). Bottom panel. We show score functions SE0and SE1before and after a change-point occurs in the environment. As a

consequence, the device optimally conﬁgured under E0needs now to be re-calibrated to the new optimal conﬁguration for E1.

experiment. The initially-optimal conﬁguration can be attained by model-free RL schemes [10,52], i.e. trial-and-error

learning mechanisms. In this approach, the score function is typically estimated from the rewards acquired from each

device conﬁguration, e.g. by an empirical estimation of its value functions (see Appendix A). In this work, we depart from

this concept by initializing the value function estimates to a surrogate quantity, deﬁned by the effective score function

S˜

E(θ

θ); such quantity is to be estimated out of a few experiment repetitions, and serves as an ansatz for which score value

is assigned to a given device conﬁguration (see Fig. 1for a schematic representation).

The usage of effective score values is motivated by the fact that experimental conditions might not dramatically differ

from the ideal case. For example, the VQE energy landscape shall preserve certain similarities between a noiseless scena-

rio and a noisy one, assuming the noise strength is sufﬁciently low [63]. From this informed initialization of value-function

estimates, we then exploit the model-free features of traditional RL algorithms, which allows the agent to ﬁne-tune the

device conﬁguration, adapting it to speciﬁc deployment conditions.

The mechanism described above constitutes the calibration stage, in which the actions performed by the calibrating

agents can be rewarded according to their accuracy/correctness. With the initial calibration task accomplished, the device

is then deployed, e.g. used without the necessity of rewarding the agent. As experiments proceed, it is to be expected that

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 98

the device undergoes a decalibration, e.g. experimental conditions might eventually vary. In order to detect such a change

occurs, the agent controls the decalibration witness Wd—for example measurement outcome probabilities —, which is

used by the speciﬁc change-point detection protocol the agent keeps. Thus, by monitoring Wd, the agent can detect that

the device entered into a decalibration stage, e.g. the deployment conditions have changed. As a consequence, the new

optimal conﬁguration is a different one, and a re-calibration is carried out. This is done similarly to the initial calibration

stage: the effective model conﬁguration landscape is estimated out of few experiments, and the model-free RL algorithm

is then used to adjust the conﬁguration to the new optimal one.

The effective model used by the agent S˜

E(θ

θ), along with the decalibration witness Wdand the RL algorithm (e.g. search

strategy, value functions, reward deﬁnitions), deﬁne a re-calibration strategy. However, each of the strategy components

requires the agent to pre-set a number of hyperparameters. Among them, the number of experiment repetitions needed to

estimate effective-model conﬁguration landscape (which we denote as Neff), the number of experiments needed to ﬁne-

tune the conﬁguration using a RL method, denoted as Nrl , the undecision region for which values that take Wdwill not

lead the agent to re-activate the calibration routine, and the parameters deﬁning the behavior of the RL routine, whose

nature depends on the particular algorithm used. In order to help with notation, we will comprise all such parameters by ξ

ξ;

in Algorithm 1a pseudo-code of our re-calibration method is provided. In the following, we showcase the re-calibration

Algorithm 1: Automatic re-calibration method.

input : S˜

E(θ

θ),Wd,ξ

ξ,RL-algorithm

output: θ

θ∗(optimal conﬁguration)

1Calibration stage by S˜

E(θ

θ)

2Fine-tuning by RL

3Deployment stage

4while Wdretrieves normal do

5deploy device

6if Wdretrieves anomaly then

7return to step 1

method introduced above in a canonical example for long-distance classical-quantum communication. We stress that our

method can be applied to a wide variety of scenarios, not necessarily constrained to the quantum technology realm.

III. Illustrative example and numerical development

As an application example, we consider the binary coherent-state discrimination, which is a primitive used in long-

distance classical-quantum communication. The usage of quantum resources is expected to boost long-distance commu-

nication rates [64,65]and provide unconditional security [66]. Optimally performing quantum-state discrimination is of

utmost importance to reach capacity rates [67,68], and the binary coherent-state discrimination problem currently stands

as a canonical problem both from a theoretical point of view [69-72], as well as an experimental one [11-13,15-17,73-78].

The interest on this problem lies on the fact that the optimal quantum measurement to be done by the receiver can be im-

plemented sequentially, combining linear optical operations and feedback operations, which constitutes an experimentally

friendly setting.

In this setting, the sender encodes a bit k=0,1 in the phase of a quantum coherent-state |(−1)kα⟩, which is sent to the

receiver; e.g. the signal is prepared in an orbital space (satellite) station, travels through the atmosphere and arrives to a

receiver, in a ground-earth station. The latter performs a binary-outcome quantum measurement, leading to measurement

outcome n∈ {0,1}. With this information, the receiver provides a guess ˆ

kon the value of the bit transmitted, and the

quality of such protocol is given by the success probability. Such a quantity represents the score function SE(θ

θ)intro-

duced in Sec. II, and depends on the intensity |α|2of the transmitted states, the quantum channel acting over which the

communication takes place, and the speciﬁc quantum measurement that is performed by the receiver.

Among all possible quantum measurements that the receiver can implement, we will here focus on the Kennedy receiver

[79] , which consists in displacing the incoming signal by a value θand measuring the resulting state via an on/off photo-

detector, as schematized in Fig. 3. While the optimal quantum measurement is given by the Dolinar receiver [70,76] ,

which involves complex conditional measurements ultimately leading to difﬁculties in experimental implementations [73,

74], the Kennedy receiver can readily beat the standard quantum limit [80] and essentially constitutes the main building

block of the former one.

In this example, the device conﬁguration is deﬁned by (i) the parameter θin the displacement operation, and (ii) a

guessing rule which associates the measurement outcome nto the guessed value of the initially transmitted bit k. We note

that access to the score value (success probability) is granted only in cases where the transmission channel and device

functioning have been perfectly characterized. Such is not often the case, as atmospheric conditions turn to strongly vary

unpredictably, a fact that ultimately affects the transmission performance [11-13,15-17,77].

We now revisit the re-calibration framework introduced in Sec.II for the Kennedy receiver. As stated above, the score

function SE(θ

θ)is given by the success probability of the communication protocol, which depends on the displacement

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 99

FIG. 3: Diagram of a Kennedy receiver; this consists in applying a displacement θto the incoming signal and measure it with an on/off

photo-detector.

value θ, and the guessing rule ˆ

k(θ,n). Note that if access to the outcome probabilities is granted, then the agent would

perform a maximum-likelihood guess. However, in situations where such probabilities are not available, e.g. no model

of transmission channel, then the agent needs also to learn the optimal guessing rule. Thus, we remark that the score

function is dependent on the speciﬁc transmission channel acting between sender and receiver, and potentially differs

from the noiseless success probability, i.e. identity channel acting in between parties. The latter quantity constitutes in

our approach the effective score S˜

E(θ

θ). Here, the intensity |α|2is initially estimated using Neff experiments, where the

displacement value θis set to zero, and thus the outcomes probabilities can readily be linked to αthrough the Born

rule, p(n=0|(−1)kα) = e−|(−1)kα|2, with ∑i=0,1p(n=i|(−1)kα) = 1. Outcome statistics are used to estimate the signal

intensity, which in turn serves as a way to initialize the state-action value functions {Q(θ),Q(ˆ

k;n,θ)}to the success

probability of setting displacement θand conditional probabilities of having ˆ

kgiven observation nand displacement θ

respectively.

The aforementioned quantities are consequently used by a Q-learning agent, which ﬁne-tunes the calibrating strategy

to the experimental conditions at hand; this is done by providing a binary reward to the agent according to the correctness

of its guess ˆ

k, and it can be proven that such scheme converges to the optimal device conﬁguration [10]. The Q-learning

method is applied for Nrl experiments, and then the receiver is deployed. While in deployment stage, the agent monitors

the measurement outcome statistics, by keeping track of a running average. This quantity serves as a decalibration witness

Wd, and abrupt changes of this quantity indicate that a change-point has occurred. When the system is out of the expected

region (speciﬁed by the agent), the calibration protocol is re-started.

Malfunctioning device example. We now consider the case in which the Kennedy receiver is initially calibrated to its

optimal conﬁguration, with a pre-deﬁned intensity value |α0|2, deployed to ideal conditions for such initial environment

E0, and incurs into a decalibration. The new environment E1consists in a different intensity value |α1|2of the signals

arriving to the receiver, plus a faulty displacement. Here, the value θthe agent ﬁxes, actually displaces the signal by a

value λθ, with λ⩾1 being an unknown parameter. The effect of this faulty behavior is to make displacements bigger

than expected, shifting the value of the optimal conﬁguration θ∗. As a consequence, the score-function landscape gets

modiﬁed. We remark that the malfunctioning behavior is unknown to the agent, who ﬁrst loads the Q-values for the ideal

case using its effective model S˜

E(θ

θ), and then ﬁne-tunes them by Q-learning; further details on the implementation are

provided in Appendix A.

In Fig. 4(top) we show the learning-curve of the agent in terms of cumulative reward acquired. The decalibration wit-

ness Wdis taken to be an estimate of the outcome probability ˆp(n=1), and by monitoring abrupt changes in this quantity,

the agent is able to detect the environment shift E0→E1. The change point in which the device enters a malfunctioning

stage occurs at experiment 5 105, and can readily be seen in the top panel of Fig. 4by the change of Wdbehavior. Addi-

tionally, this can be detected by an abrupt change in the cumulative reward acquired; however, we note that such quantity

is potentially not available during deployment stage. As a consequence, the agent uses its change-point detection strategy

to re-activate the calibration protocol again, by estimating the new signal intensity and initializing the Q-values in the

effective model obtained thereby. Note that in this new scenario, the effective model does not coincide anymore with the

underlying truth. This fact is illustrated by the initial and ﬁnal Q-values obtained by the agent, shown in Fig. 4(bottom),

where we additionally show the optimal conﬁguration ˆ

θ∗suggested by the agent at a given experiment.

IV. Outlook & future research directions

In this work, we presented a re-calibration framework, accompanied by an automatic re-calibration method, and targeted

to quantum technology applications.

We illustrated the proposed method by studying a Kennedy receiver under heavily varying deployment conditions.

As in any device, decalibration is a frequent problem that needs to be addressed. This example serves as a test-bed for

our automatic re-calibration framework, showing that not only the calibrating agent is able to conﬁgure the device in a

semi-agnostic way, but also to detect situations in which the device gets off-calibrated. Our mechanism allows for the

automation of the re-calibration process and can readily be applied to a wider scope, even beyond quantum technology

applications.

Speciﬁcally, our technique reduces the number of experiment repetitions needed to (re)-calibrate the device. This is

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 100

FIG. 4: Recalibration and learning curve. (Top): Learning curve evolution. Running average of the reward acquired using 103expe-

riments, and the evolution of the decalibration witness Wdestimated by measurement statistics. As can be seen in the change-point,

the Witness presents a big ﬂuctuation, starting a recalibration of the system until the agent converges to the optimal reward. (Bottom):

Update of the Q-values curve (left) and evolution of agent’s greedy strategy, i.e. the conﬁguration the calibrating agent would choose

at each experiment (right).

done by making use of an effective model, whose purpose is to capture main features in the conﬁguration landscape, and

is complemented by model-free reinforcement-learning techniques. Additionally, we introduce the decalibration witness

statistic, which plays a key role in detecting either novelties or anomalies referring to the device’s functioning. Such

quantity is conceived as a ﬁgure of merit to be calculated during device deployment. In this stage, the score function for

the quality of the controls that are chosen by the agent is not computable, and the agent can only rely on information

available in the experiment, e.g. statistics from the measurement outcome.

A plethora of change-point/anomaly detection methods can be used in order to complement our method [30-32,81-83].

However, let us remark that an alternative to monitoring the decalibration witnesses can also be brought to attention, i.e.

by presetting a calibration control routine. Such scheme demands balancing between device deploying and guaranteeing

that the optimal conﬁguration is being kept, and can potentially be implemented by allowing intermediate calibration

stages in between deployment. We remark that while model-free RL techniques could potentially adapt the controls to

smooth changes in the optimal conﬁguration (without the necessity of an effective model nor a decalibration witness),

abrupt changes would in practice corrupt a successful adaptation. Here, an abruptness notion is unveiled when it comes

to environment changes: on the one hand we identify continual reinforcement learning [84,85] (where the calibration

agent smoothly adapts the conﬁguration as the environment smoothly varies), and on the other hand domain adaptation

in reinforcement learning [86,87] (where the calibration needs to be adapted under changes of abrupt nature, as the ones

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 101

considered in this paper). The setting studied in here might also be tackled from an active learning framework [88], in

which the agent may inject prior knowledge on the different conditions in which the device is expected to be deployed,

and can potentially be used to further exploit the symbiosis between model-free and model-aware routines considered

above.

A straightforward extension of settings where our re-calibration framework ﬁnds real-world implementations is given

by Noisy Intermediate-Scale Quantum (NISQ) devices, where the strong presence of noise severely limits the scope of

applications, and developing tools to address such issues is an active area of research.

Furthermore, our work opens the door for several follow-up implementations and enhancements of the re-calibration

protocol. Among them, usage of more sophisticated RL methods [43,89] and inspecting the possibility of a coherent

re-calibration by usage of quantum correlations [90,91].

V. Acknowledgments

M.B. is grateful for useful discussions with John Calsamiglia about enhancing the Q-values initialization. T.C wants

to thank F. Tomás B. Perez, for ideas in possible applications and numerical results. T.C. acknowledges the support from

the CVC fellowship program. M.B. acknowledges support from AGAUR Grant no. 2023 INV-2 00034 funded by the Eu-

ropean Union, Next Generation EU and Grant PID2021-126808OB-I00 funded by MCIN/AEI/ 10.13039/501100011033

and by ERDF A way of making Europe. F.V and M.B acknowledge the support from the Spanish Ministry of Science

and Innovation through the project GRAIL, grant no. PID291-1268080B-100. J. M. M. and L.R. acknowledge to Consejo

Nacional de Investigaciones Cientíﬁcas y Técnicas (CONICET) and support from ANPCyT Argentina, Préstamo BID,

Grant no. PICT 20203490.

REFERENCES

[1] V. Cimini, E. Polino, M. Valeri, I. Gianani, N. Spagnolo, G. Corrielli, A. Crespi, R. Osellame, M. Barbieri y F. Sciarrino.

Calibration of Multiparameter Sensors via Machine Learning at the Single-Photon Level. Physical Review Applied 15 (abr. de

2021).http://dx.doi.org/10.1103/physrevapplied.15.044003.

[2] V. Cimini, M. Valeri, S. Piacentini, F. Ceccarelli, G. Corrielli, R. Osellame, N. Spagnolo y F. Sciarrino. Variational quantum

algorithm for experimental photonic multiparameter estimation. npj Quantum Information 10 (feb. de 2024).http://dx.doi.org/

10.1038/s41534-024-00821-0.

[3] H. Ren, J. Yang, X. Liu, P. Huang y L. Guo. Sensor Modeling and Calibration Method Based on Extinction Ratio Error for

Camera-Based Polarization Navigation Sensor. Sensors 20 (jul. de 2020).http://dx.doi.org/10.3390/s20133779.

[4] F. Vernuccio, A. Bresci, V. Cimini, A. Giuseppi, G. Cerullo, D. Polli y C. M. Valensise. Artiﬁcial Intelligence in Classical and

Quantum Photonics. Laser & Photonics Reviews 16 (mar. de 2022).http://dx.doi.org/10.1002/lpor.202100399.

[5] K. Ono. Calibration Methods of Acoustic Emission Sensors. Materials 9(jun. de 2016).http://dx.doi.org/10.3390/ma9070508.

[6] S. Zhao, J. Liu e Y. Li. Online Calibration Method for Current Sensors Based on GPS. Energies 12 (mayo de 2019).http:

//dx.doi.org/10.3390/en12101923.

[7] H. Haitjema. The Calibration of Displacement Sensors. Sensors 20 (ene. de 2020).http://dx.doi.org/10.3390/s20030584.

[8] V. Cimini, I. Gianani, N. Spagnolo, F. Leccese, F. Sciarrino y M. Barbieri. Calibration of Quantum Sensors by Neural Networks.

Physical Review Letters 123 (dic. de 2019).http://dx.doi.org/10.1103/physrevlett.123.230502.

[9] Y. Zhang, L. O. H. Wijeratne, S. Talebi y D. J. Lary. Machine Learning for Light Sensor Calibration. Sensors 21 (sep. de 2021).

http://dx.doi.org/10.3390/s21186259.

[10] M. Bilkis, M. Rosati, R. M. Yepes y J. Calsamiglia. Real-time calibration of coherent-state receivers: Learning by trial and error.

Phys. Rev. Res. 2, 033295 (ago. de 2020).https://link.aps.org/doi/10.1103/PhysRevResearch.2.033295.

[11] M. Bilkis, M. Rosati y J. Calsamiglia. Reinforcement-learning calibration of coherent-state receivers on variable-loss optical

channels en 2021 IEEE Information Theory Workshop (ITW) (2021), 1-6.

[12] D. Dequal, L. T. Vidarte, V. R. Rodriguez, A. Leverrier, G. Vallone, P. Villoresi y E. Diamanti. Feasibility of satellite quantum

key distribution with continuous variable en Quantum Inf. Meas. V Quantum Technol. Part F165- (OSA, Washington, D.C.,

feb. de 2019), T5A.89. ISBN: 978-1-943580-56-9. arXiv: 2002. 02002.http://arxiv.org/ abs/2002.02002%20https:/ /www.

osapublishing.org/abstract.cfm?URI=QIM-2019-T5A.89.

[13] L. C. Andrews y R. L. Phillips. Laser Beam Propagation through Random Media ISBN: 9780819459480. http : / / ebooks .

spiedigitallibrary.org/book.aspx?doi=10.1117/3.626196 (SPIE, 1000 20th Street, Bellingham, WA 98227-0010 USA, sep. de

2005).

[14] S. Pirandola. Satellite quantum communications: Fundamental bounds and practical security. Phys. Rev. Res. 3, 023130 (ma-

yo de 2021).ISSN: 2643-1564. https://link.aps.org/doi/10.1103/PhysRevResearch.3.023130.

[15] S. Pirandola. Limits and security of free-space quantum communications. Phys. Rev. Res. 3, 013279 (mar. de 2021).ISSN:

2643-1564. https://link.aps.org/doi/10.1103/PhysRevResearch.3.013279.

[16] D. Y. Vasylyev, A. A. Semenov y W. Vogel. Toward Global Quantum Communication: Beam Wandering Preserves Nonclassi-

cality. Phys. Rev. Lett. 108, 220501 (jun. de 2012).https://link.aps.org/doi/10.1103/PhysRevLett.108.220501.

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 102

[17] D. Vasylyev, A. A. Semenov, W. Vogel, K. Günthner, A. Thurn, Ö. Bayraktar y C. Marquardt. Free-space quantum links under

diverse weather conditions. Phys. Rev. A 96, 043856 (oct. de 2017).https://link.aps.org/doi/10.1103/PhysRevA.96.043856.

[18] D. Jung, Y. Choi y J. Kim. Multiobjective Automatic Parameter Calibration of a Hydrological Model. Water 9(mar. de 2017).

http://dx.doi.org/10.3390/w9030187.

[19] D. Kavetski, G. Kuczera y S. W. Franks. Calibration of conceptual hydrological models revisited: 1. Overcoming numerical

artefacts. Journal of Hydrology 320. The model parameter estimation experiment, 173-186 (2006).ISSN: 0022-1694. https :

//www.sciencedirect.com/science/article/pii/S0022169405003379.

[20] K. Papafotis, D. Nikitas y P. P. Sotiriadis. Magnetic Field Sensors’ Calibration: Algorithms’ Overview and Comparison. Sensors

21 (ago. de 2021).http://dx.doi.org/10.3390/s21165288.

[21] G. Cao, X. Xu y D. Xu. Real-Time Calibration of Magnetometers Using the RLS/ML Algorithm. Sensors 20 (ene. de 2020).

http://dx.doi.org/10.3390/s20020535.

[22] A. Fallani, M. A. C. Rossi, D. Tamascelli y M. G. Genoni. Learning Feedback Control Strategies for Quantum Metrology. PRX

Quantum 3, 020310 (abr. de 2022).https://link.aps.org/doi/10.1103/PRXQuantum.3.020310.

[23] L. J. Fiderer, J. Schuff y D. Braun. Neural-Network Heuristics for Adaptive Bayesian Quantum Estimation. PRX Quantum 2,

020303 (abr. de 2021).https://link.aps.org/doi/10.1103/PRXQuantum.2.020303.

[24] L. J. Fiderer y D. Braun. Quantum metrology with quantum-chaotic sensors. Nature Communications 9, 1351 (abr. de 2018).

ISSN: 2041-1723. https://doi.org/10.1038/s41467-018-03623-z.

[25] C. Lee, B. Lawrie, R. Pooser, K.-G. Lee, C. Rockstuhl y M. Tame. Quantum Plasmonic Sensors. Chemical Reviews 121 (mar. de

2021).http://dx.doi.org/10.1021/acs.chemrev.0c01028.

[26] S. Nolan, A. Smerzi y L. Pezzè. A machine learning approach to Bayesian parameter estimation. npj Quantum Information 7

(dic. de 2021).http://dx.doi.org/10.1038/s41534-021-00497-w.

[27] Y. Ban, J. Echanobe, Y. Ding, R. Puebla y J. Casanova. Neural-network-based parameter estimation for quantum detection.

Quantum Science and Technology 6(ago. de 2021).http://dx.doi.org/10.1088/2058-9565/ac16ed.

[28] Y. Chen, Y. Ban, R. He, J.-M. Cui, Y.-F. Huang, C.-F. Li, G.-C. Guo y J. Casanova. A neural network assisted 171Yb+ quantum

magnetometer. npj Quantum Information 8(dic. de 2022).http://dx.doi.org/10.1038/s41534-022-00669-2.

[29] K. Rambhatla, S. E. D’Aurelio, M. Valeri, E. Polino, N. Spagnolo y F. Sciarrino. Adaptive phase estimation through a genetic

algorithm. Physical Review Research 2(jul. de 2020).http://dx.doi.org/10.1103/physrevresearch.2.033078.

[30] G. Sentís, E. Bagan, J. Calsamiglia, G. Chiribella y R. Muñoz-Tapia. Quantum Change Point. Phys. Rev. Lett. 117, 150502

(oct. de 2016).https://link.aps.org/doi/10.1103/PhysRevLett.117.150502.

[31] M. Fanizza, C. Hirche y J. Calsamiglia. Ultimate Limits for Quickest Quantum Change-Point Detection. Phys. Rev. Lett. 131,

020602 (jul. de 2023).https://link.aps.org/doi/10.1103/PhysRevLett.131.020602.

[32] G. Sentís, J. Calsamiglia y R. Muñoz-Tapia. Exact Identiﬁcation of a Quantum Change Point. Phys. Rev. Lett. 119, 140506

(oct. de 2017).https://link.aps.org/doi/10.1103/PhysRevLett.119.140506.

[33] K. A. Wo´zniak, V. Belis, E. Puljak, P. Barkoutsos, G. Dissertori, M. Grossi, M. Pierini, F. Reiter, I. Tavernelli y S. Vallecorsa.

Quantum anomaly detection in the latent space of proton collision events at the LHC 2023. eprint: arXiv:2301.10780.

[34] J. S. Baker, H. Horowitz, S. K. Radha, S. Fernandes, C. Jones, N. Noorani, V. Skavysh, P. Lamontangne y B. C. Sanders.

Quantum Variational Rewinding for Time Series Anomaly Detection 2022. eprint: arXiv:2210.16438.

[35] M. Guo, S. Pan, W. Li, F. Gao, S. Qin, X. Yu, X. Zhang y Q. Wen. Quantum algorithm for unsupervised anomaly detection.

Physica A: Statistical Mechanics and its Applications 625, 129018 (2023).ISSN: 0378-4371. https://www.sciencedirect.com/

science/article/pii/S0378437123005733.

[36] S. Llorens, G. Sentís y R. Muñoz-Tapia. Quantum multi-anomaly detection 2023. eprint: arXiv:2312.13020.

[37] M. Skotiniotis, R. Hotz, J. Calsamiglia y R. Muñoz-Tapia. Identiﬁcation of malfunctioning quantum devices 2018. eprint: arXiv:

1808.02729.

[38] N. Liu y P. Rebentrost. Quantum machine learning for quantum anomaly detection. Phys. Rev. A 97, 042315 (abr. de 2018).

https://link.aps.org/doi/10.1103/PhysRevA.97.042315.

[39] M. A. Nielsen e I. L. Chuang. Quantum Computation and Quantum Information (Cambridge University Press, 2000).

[40] M. P. A. Branderhorst, J. Nunn, I. A. Walmsley y R. L. Kosut. Simpliﬁed quantum process tomography. New Journal of Physics

11, 115010 (nov. de 2009).https://dx.doi.org/10.1088/1367-2630/11/11/115010.

[41] A. Shabani, R. L. Kosut, M. Mohseni, H. Rabitz, M. A. Broome, M. P. Almeida, A. Fedrizzi y A. G. White. Efﬁcient Measure-

ment of Quantum Dynamics via Compressive Sensing. Phys. Rev. Lett. 106, 100401 (mar. de 2011).https://link.aps.org/doi/10.

1103/PhysRevLett.106.100401.

[42] M. T. DiMario y F. E. Becerra. Channel-noise tracking for sub-shot-noise-limited receivers with neural networks. Physical

Review Research 3(mar. de 2021).http://dx.doi.org/10.1103/physrevresearch.3.013200.

[43] R. Sutton y A. G. Barto. Reinforcement Learning Sutton ISBN: 9780262039246 (MIT Press, 2018).

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 103

[44] A. Dawid, J. Arnold, B. Requena, A. Gresch, M. Płodzie´

n, K. Donatella, K. A. Nicoli, P. Stornati, R. Koch, M. Büttner, R.

Okuła, G. Muñoz-Gil, R. A. Vargas-Hernández, A. Cervera-Lierta, J. Carrasquilla, V. Dunjko, M. Gabrié, P. Huembeli, E. van

Nieuwenburg, F. Vicentini, L. Wang, S. J. Wetzel, G. Carleo, E. Greplová, R. Krems, F. Marquardt, M. Tomza, M. Lewenstein

y A. Dauphin. Modern applications of machine learning in quantum sciences 2022. eprint: arXiv:2204.04198.

[45] S. Borah, B. Sarma, M. Kewming, G. J. Milburn y J. Twamley. Measurement-Based Feedback Quantum Control with Deep

Reinforcement Learning for a Double-Well Nonlinear Potential. Phys. Rev. Lett. 127, 190403 (nov. de 2021).https://link.aps.

org/doi/10.1103/PhysRevLett.127.190403.

[46] H. J. Briegel y G. De las Cuevas. Projective simulation for artiﬁcial intelligence. Scientiﬁc Reports 2, 400 (mayo de 2012).ISSN:

2045-2322. https://doi.org/10.1038/srep00400.

[47] J. Wallnöfer, A. A. Melnikov, W. Dür y H. J. Briegel. Machine Learning for Long-Distance Quantum Communication. PRX

Quantum 1, 010301 (sep. de 2020).https://link.aps.org/doi/10.1103/PRXQuantum.1.010301.

[48] C. Cui, W. Horrocks, S. Hao, S. Guha, N. Peyghambarian, Q. Zhuang y Z. Zhang. Quantum receiver enhanced by adaptive

learning. Light: Science & Applications 11, 344 (dic. de 2022).ISSN: 2047-7538. https://doi.org/10.1038/s41377-022-01039-5.

[49] N. Rengaswamy, K. P. Seshadreesan, S. Guha y H. D. Pﬁster. Belief propagation with quantum messages for quantum-enhanced

classical communications. npj Quantum Information 7, 97 (jun. de 2021).ISSN: 2056-6387. https://doi.org/10.1038/s41534-

021-00422-1.

[50] C. Piveteau y J. M. Renes. Quantum message-passing algorithm for optimal and efﬁcient decoding. Quantum 6, 784 (ago. de

2022).ISSN: 2521-327X. https://doi.org/10.22331/q-2022-08-23-784.

[51] C. L. Cortes, P. Lefebvre, N. Lauk, M. J. Davis, N. Sinclair, S. K. Gray y D. Oblak. Sample-efﬁcient adaptive calibration of

quantum networks using Bayesian optimization. Phys. Rev. Applied (mar. de 2022). journals.aps.org/prapplied/abstract/10.

1103/PhysRevApplied.17.034067.

[52] V. V. Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios y M. H. Devoret. Model-Free Quantum Control with Reinforcement

Learning. Phys. Rev. X 12, 011059 (mar. de 2022).https://link.aps.org/doi/10.1103/PhysRevX.12.011059.

[53] M. Y. Niu, S. Boixo, V. N. Smelyanskiy y H. Neven. Universal quantum control through deep reinforcement learning. npj

Quantum Information 5, 33 (abr. de 2019).ISSN: 2056-6387. https://doi.org/10.1038/s41534-019-0141-3.

[54] T. Fösel, P. Tighineanu, T. Weiss y F. Marquardt. Reinforcement Learning with Neural Networks for Quantum Feedback. Phys.

Rev. X 8, 031084 (sep. de 2018).https://link.aps.org/doi/10.1103/PhysRevX.8.031084.

[55] P. Altmann, J. Stein, M. Kölle, A. Bärligea, T. Gabor, T. Phan, S. Feld y C. Linnhoff-Popien. Challenges for Reinforcement

Learning in Quantum Circuit Design 2023. eprint: arXiv:2312.11337.

[56] M. Nägele y F. Marquardt. Optimizing ZX-Diagrams with Deep Reinforcement Learning 2023. eprint: arXiv:2311.18588.

[57] A. Skolik, M. Cattelan, S. Yarkoni, T. Bäck y V. Dunjko. Equivariant quantum circuits for learning on weighted graphs 2022.

eprint: arXiv:2205.06109.

[58] A. Khandelwal y S. DiAdamo. Enhancing Protocol Privacy with Blind Calibration of Quantum Devices 2022. eprint: arXiv:

2209.05634.

[59] B. Zhou, C. Lu, B.-M. Mao, H.-y. Tam y S. He. Magnetic ﬁeld sensor of enhanced sensitivity and temperature self-calibration

based on silica ﬁber Fabry-Perot resonator with silicone cavity. Opt. Express 25, 8108-8114 (abr. de 2017).https://opg.optica.

org/oe/abstract.cfm?URI=oe-25-7-8108.

[60] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio

y P. J. Coles. Variational quantum algorithms. Nature Reviews Physics 3, 625-644 (sep. de 2021).ISSN: 2522-5820. https :

//doi.org/10.1038/s42254-021-00348-9.

[61] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T.

Menke, W.-K. Mok, S. Sim, L.-C. Kwek y A. Aspuru-Guzik. Noisy intermediate-scale quantum algorithms. Rev. Mod. Phys.

94, 015004 (feb. de 2022).https://link.aps.org/doi/10.1103/RevModPhys.94.015004.

[62] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik y J. L. O’Brien. A variational

eigenvalue solver on a photonic quantum processor. Nature Communications 5, 4213 (jul. de 2014).ISSN: 2041-1723. https:

//doi.org/10.1038/ncomms5213.

[63] E. Fontana, M. Cerezo, A. Arrasmith, I. Rungger y P. J. Coles1. Non-trivial symmetries in quantum landscapes and their resi-

lience to quantum noise. Quantum 6(sep. de 2022).https://quantum-journal.org/papers/q-2022-09-15-804/#.

[64] K. Banaszek, L. Kunz, M. Jachura y M. Jarzyna. Quantum Limits in Optical Communications. J. Light. Technol. 38, 2741-2754

(mayo de 2020).ISSN: 0733-8724. arXiv: 2002.05766.https://ieeexplore.ieee.org/document/8998224/.

[65] M. Rosati y V. Giovannetti. Achieving the Holevo bound via a bisection decoding protocol. J. Math. Phys. 57, 062204 (jun. de

2015).ISSN: 00222488. arXiv: 1506.04999.http://aip.scitation.org/doi/10.1063/1.4953690%20http://arxiv.org/abs/1506.

04999%20http://dx.doi.org/10.1063/1.4953690.

[66] S. Pirandola, U. L. Andersen, L. Banchi, M. Berta, D. Bunandar, R. Colbeck, D. Englund, T. Gehring, C. Lupo, C. Ottaviani,

J. L. Pereira, M. Razavi, J. Shamsul Shaari, M. Tomamichel, V. C. Usenko, G. Vallone, P. Villoresi y P. Wallden. Advances in

quantum cryptography. Adv. Opt. Photonics 12, 1012 (dic. de 2020).ISSN: 1943-8206. arXiv: 1906.01645.http://arxiv.org/abs/

1906.01645%20http://dx.doi.org/10.1364/AOP.361502%20https://www.osapublishing.org/abstract.cfm?URI=aop-12-4-1012.

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 104

[67] M. M. Wilde. Quantum Information Theory (Cambridge University Press, 2013).

[68] R. Nasser y J. M. Renes. Polar codes for arbitrary classical-quantum channels and arbitrary cq-MACs en 2017 IEEE Interna-

tional Symposium on Information Theory (ISIT) (2017), 281-285.

[69] C. W. Helstrom. Quantum Detection and Estimation Theory 309. ISBN: 9780123400505 (Academic press, New York, 1976).

[70] A. Assalini, N. Dalla Pozza y G. Pierobon. Revisiting the Dolinar receiver through multiple-copy state discrimination theory.

Phys. Rev. A 84, 022342 (ago. de 2011).https://link.aps.org/doi/10.1103/PhysRevA.84.022342.

[71] F. Zoratti, N. Dalla Pozza, M. Fanizza y V. Giovannetti. Agnostic Dolinar receiver for coherent-state classiﬁcation. Phys. Rev. A

104, 042606 (oct. de 2021).https://link.aps.org/doi/10.1103/PhysRevA.104.042606.

[72] M. Takeoka, M. Sasaki, P. van Loock y N. Lütkenhaus. Implementation of projective measurements with linear optics and

continuous photon counting. Phys. Rev. A 71, 022318 (feb. de 2005).https://link.aps.org/doi/10.1103/PhysRevA.71.022318.

[73] R. L. Cook, P. J. Martin, J. M. Geremia, B. A. Chase y J. M. Geremia. Optical coherent state discrimination using a closed-loop

quantum measurement. Nature 446, 774-777 (abr. de 2007).ISSN: 14764687. http ://www. nature . com / doiﬁnder / 10 .1038/

nature05655%20http://www.nature.com/articles/nature05655.

[74] D. Sych y G. Leuchs. Practical Receiver for Optimal Discrimination of Binary Coherent Signals. Phys. Rev. Lett. 117, 200501

(nov. de 2016).https://link.aps.org/doi/10.1103/PhysRevLett.117.200501.

[75] F. E. Becerra, J. Fan, G. Baumgartner, S. V. Polyakov, J. Goldhar, J. T. Kosloski y A. Migdall. M-ary-state phase-shift-keying

discrimination below the homodyne limit. Phys. Rev. A 84, 062324 (dic. de 2011).https://link.aps.org/doi/10.1103/PhysRevA.

84.062324.

[76] S. J. Dolinar. Communication and sciences engineering. Q. Prog. Rep. (Research Lab. Electron. 111, 115 (1973). https://dspace.

mit.edu/handle/1721.1/56414.

[77] V. C. Usenko, B. Heim, C. Peuntinger, C. Wittmann, C. Marquardt, G. Leuchs y R. Filip. Entanglement of Gaussian states and

the applicability to quantum key distribution over fading channels. New J. Phys. 14, 093048 (sep. de 2012).ISSN: 1367-2630.

https://iopscience.iop.org/article/10.1088/1367-2630/14/9/093048.

[78] M. T. DiMario y F. E. Becerra. Demonstration of optimal non-projective measurement of binary coherent states with photon

counting. npj Quantum Information 8, 84 (jul. de 2022).ISSN: 2056-6387. https://doi.org/10.1038/s41534-022-00595-3.

[79] R. S. Kennedy. Near-Optimum Receiver for the Binary Coherent State Quantum Channel. MIT Res. Lab. Electron. Q. Prog.

Rep. 108, 219 (1973). https://dspace.mit.edu/handle/1721.1/56346.

[80] A. Ferraro, S. Olivares y M. G. A. Paris. Gaussian states in continuous variable quantum information ISBN: ISBN 88-7088-483-

X (Bibliopolis, Napoli, 2005).

[81] E. S. PAGE. A test for a change in a parameter occurring at an unknown point. Biometrika 42, 523-527 (dic. de 1955).ISSN:

0006-3444. eprint: https://academic.oup.com/biomet/article-pdf/42/3-4/523/838813/42-3-4-523.pdf.https://doi.org/10.1093/

biomet/42.3-4.523.

[82] E. Brodsky y B. Darkhovsky. Non-Parametric Statistical Diagnosis: Problems and Methods ISBN: 9789048154654. https :

//books.google.es/books?id=Ar56cgAACAAJ (Springer Netherlands, 2010).

[83] M. Basseville e I. Nikiforov. Detection of Abrupt Change Theory and Application ISBN: 0-13-126780-9 (Prentice Hall, abr. de

1993).

[84] D. Abel, A. Barreto, B. V. Roy, D. Precup, H. van Hasselt y S. Singh. A Deﬁnition of Continual Reinforcement Learning 2023.

eprint: arXiv:2307.11046.

[85] K. Khetarpal, M. Riemer, I. Rish y D. Precup. Towards Continual Reinforcement Learning: A Review and Perspectives 2020.

eprint: arXiv:2012.13490.

[86] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba y P. Abbeel. Domain randomization for transferring deep neural networks

from simulation to the real world en 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017),

23-30.

[87] J. Xing, T. Nagata, K. Chen, X. Zou, E. Neftci y J. L. Krichmar. Domain Adaptation In Reinforcement Learning Via Latent

Uniﬁed State Representation 2021. eprint: arXiv:2102.05714.

[88] P. Radeva, M. Drozdzal, S. Segui, L. Igual, C. Malagelada, F. Azpiroz y J. Vitria. Active labeling: Application to wireless

endoscopy analysis en 2012 International Conference on High Performance Computing & Simulation (HPCS) (2012), 174-181.

[89] Y. Baum, M. Amico, S. Howell, M. Hush, M. Liuzzi, P. Mundada, T. Merkh, A. R. Carvalho y M. J. Biercuk. Experimental

Deep Reinforcement Learning for Error-Robust Gate-Set Design on a Superconducting Quantum Computer. PRX Quantum 2,

040324 (nov. de 2021).https://link.aps.org/doi/10.1103/PRXQuantum.2.040324.

[90] Y. Liao, M.-H. Hsieh y C. Ferrie. Quantum Optimization for Training Quantum Neural Networks 2021. eprint: arXiv:2103.

17047.

[91] A. A. et. al. Quantum Optimization: Potential, Challenges, and the Path Forward 2023. eprint: arXiv:2312.02279.

[92] R. BELLMAN y S. Dreyfus. Dynamic Programming ISBN: 9780691146683. http://www.jstor.org/stable/j.ctv1nxcw0f (2024)

(Princeton University Press, 2010).

[93] C. Szepesvári. Algorithms for Reinforcement Learning 2010. http://dx.doi.org/10.2200/S00268ED1V01Y201005AIM009.

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 105

[94] C. Watkins. “Learning from delayed rewards”. PhD thesis Tesis doct. (Cambridge, 1989). http://www.cs.rhul.ac.uk/%7B~%

7Dchrisw/thesis.html.

[95] T. Lattimore y C. Szepesvári. Bandit Algorithms (Cambridge University Press, 2020).

[96] T. Crosta, M. Matera y M. Bilkis. Repository https://github.com/dmtomas/qrec (GitHub, 2024).

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 106

A. Additional details in the RL implementation

In the following, we brieﬂy present the Reinforcement Learning (RL) framework and provide further details on the

illustrative example considered Sec. III. Moreover, we brieﬂy analyze an alternative noisy scenario where a change of

priors occur, and benchmark our techniques with a standard Q-learning method.

Reinforcement Learning is based on the sequential interaction between an agent and the environment during several

episodes [43]. Each episode Econsists on steps t=1,...,T(where Tis potentially of stochastic nature). At step t, the

agent observes a state st, and follows a policy π(at|st)≡πin order to choose an action at. As a consequence, the agent

receives a reward rt+1and transitions to the next state st+1. The goal of the agent is to maximize the reward acquired during

episodes, which is accomplished by performing the optimal policy. To do this, the agent has to exploit valuable actions

but also explore possibly advantageous conﬁgurations, leading to an exploration-exploitation trade-off. The framework

allows for intermediate rewards appearing during the episode, and hence the return is deﬁned as Gt=∑T−t

k=0γkrt+k+1, with

γ⩽1 being a weighting factor. The latter is a quantity that depends on the sequence of state and actions visited during

each episode, and its average value — the so-called state-action value function— Qπ(s,a) = EπGt|st=s,at=aindicates

how valuable action ais by departing from state sand following policy πthereafter. In this setting, the optimal policy

π∗is obtained by ﬁnding the maximum Q-value for each given state, a problem that is reﬂected by the so-called optimal

Bellman equation [43,92].

In this regard, the Q-learning algorithm is a model-free method that exploits the structure of Bellman equations by

linking them to contractive operations and shifting the policy towards the ﬁxed point associated to the optimal Bellman

operator [93,94]. In order to ﬁnd the optimal Q-values Q∗(s,a) = Qπ∗(s,a), the algorithm updates the Q-estimate as

Q(st,at)←(1−λE)ˆ

Q(st,at)+λErt+1+γm´

a′

Q(st+1,a′)(1)

with λEan episode-dependent learning-rate; in Alg.2we sketch the Q-learning pseudo-code. Here, the agent explores

the state-action space by committing to a ε-greedy policy πε, deﬁned as selecting a random action with probability ε,

and the one maximizing the current state-action value estimate ˆ

Q(s,a)otherwise. Note that (i) such greedy action might

potentially be suboptimal option, and (ii) a schedule for εis set in practice, in order to balance between exploration and

exploitation [10,43,95].

We now turn to provide additional details on the numerical implementation for the Kennedy receiver considered in Sec.

III. Our code is open-sourced and can be found in Ref. [96]. All hyperparameter values used in this implementation are

given in Table 1.

Parameters Meaning Proposed

method

Q-

learning

check

jump

threshold

How much re-

peated selection

of the maximum

is considered

conversion.

3000 3000

δHow much the

change in Wdhas

to be in order to

recalibrate.

0.1 0.1

ε0Minimum explo-

ration of the agent.

0.05 0.1

∆εRate of change for

ε.

0.9 0.9999

∆How much do we

want to deviate

from a uniform

distribution.

50 0

∆lStep from which

the learning rate

starts

150 1

TABLA 1: Hyperparameters used in the numerical examples with a description of its interpretations.

Decalibration witness. In order to detect changes in the environment during an off-calibration stage, the running average

output of the detector is computed across experiments E=1,...,Neff, where we set Neff =1000, e.g. Wd(E)=1

Neff ∑Neff

i=0nt−i.

At each experiment, the difference between the current average and the previous one is computed |WdE−WdE−1|. Here,

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 107

Algorithm 2: Q-learning pseudocode.

1for episode E=1,...do

2initialize s0

3for step tin episode Edo

4choose at∼πε

5get rt,st+1

6update ˆ

Q(st,at)using Eq.1.

if this difference is bigger than the (hyper)parameter δ— and assuming the device is being deployed — the re-calibration

process is restarted.

Effective model. When the (re-)calibration is initiated, we consider an effective model given by the success probability

of a noiseless Kennedy receiver, computed by ﬁrst estimating the signal’s intensity |α|2. Such success probability can

be linked to the optimal Q-values for an ideal environment E0in which the device functions correctly [10]. To this end,

the displacement value in the Kennedy receiver is set to zero during Neff =1000 experiments, and the intensity |α|2is

estimated as per |α|2=−ln(p(n=0

α)) ±1

Neff . Consequently, the Q-learning agent ﬁne-tunes the device conﬁguration,

which is potentially deployed under an environment whose score function differs from the noiseless effective-model here

considered. Importantly, the Q-values are initialized to Qα

0(θ)and Qα

1(θ,ˆ

k)as per Qα

0(θ) = ∑n={0,1}m´

axˆ

k=0,1Qα

1(θ,ˆ

and Qα

1(θ,ˆ

k) = 1

2e−|(−1)ˆ

kα+θ|2+1

21−e−|(−1)ˆ

k+1α+θ|2

FIG. 5: We show the average internal strategy of 25 calibrating agents to ﬁne-tune the device conﬁguration under a change-of-prior

scenario. Speciﬁcally, we depict the Q-values (left panel) from the effective model, and the ones obtained after RL ﬁne-tuning. In the

right panel, we show the evolution of agent’s greedy strategy.

Q-learning hyperparameters. We scheduled the exploration rate of πεas per εE=m´

ax(ε0,εE∆ε), with ε0,∆ε∈[0,1)

and εt=0=1, e.g. it is reduced over different episodes. Here, ε0provides the minimum exploration level, while ∆ε

gives a rate of change in the exploration. On a different note, we modify the uniform sampling in πεby p(ˆa|ˆ

Q) =

Nexp−∆

ˆ

Q(ˆa∗)−Q(ˆa)



2; where 1

Nis a normalization factor, ˆ

Q(a)is the current Q-value estimate, ˆa∗is the associa-

ted greedy action, and ∆an importance-sampling parameter (∆=0 returns a uniformly random distribution) [43]. Finally,

the Q-learning learning-rate λEin Eq.(1) is set to decay as ∼1/E, where we recall that Eis the episode number. Note

that because of the initial information obtained by setting the effective model, we allow the learning-rate to take smaller

values as per 1

t+∆l, where ∆lcan be understood as how much the effective model is trusted by the RL agent.

Change of priors example. To test the resilience of the method to a different noise source, we here analyze the situation

in which a change of priors occurs. We recall the prior constitutes the probability pkof sending the classical bit k, and in

this example it gets modiﬁed as per pk→pk(λ2) = 1

2+(−1)kλ2, where λ2stands for the noise parameter, unknown to the

agent. The results of our automatic re-calibration method for this scenario are show in Fig. 5. Here, we average the results

over 25 instances obtained through random initializations, taking 103experiments to ﬁnd the optimal conﬁguration in

%90 of the runs and 7 ×103to ﬁnish.

Comparison with Q-learning: Finally, we compare the technique introduced with the standard Q-learning method [10].

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 108

FIG. 6: Re-calibration learning curve. We show the mean internal strategy of 25 reinforcement-learning agent’s, to optimize the device

conﬁguration, (top) using a faulty displacement and (bottom) a change of the prior probability. Speciﬁcally, we depict the Q-values

(left panel) obtained by trial and error and the evolution of the greedy strategy (right panel).

The results are benchmarked in Fig 6, under the noisy scenarios previously considered. As shown in the ﬁgure, traditional

Q-learning presents higher ﬂuctuations over the Q-value estimates, requiring 10xthe amount of experiments than the

method introduced in this paper, which exploits the usage of an effective model.

T. Crosta / Anales AFA Vol. 36 Nro. 4 (Diciembre 2025 - Marzo 2026) 95 - 105 109