The sparse information captured by the sensory systems is used by the brain to apprehend the environment, for example, to spatially locate the source of audiovisual stimuli. This is an ill-posed inverse problem whose inherent uncertainty can be solved by jointly processing the information, as well as introducing constraints during this process, on the way this multisensory information is handled. This process and its result - the percept - depend on the contextual conditions perception takes place in. To date, perception has been investigated and modeled on the basis of either one of two of its dimensions: the percept or the temporal dynamics of the process. Here, we extend our previously proposed audiovisual perception model to predict both these dimensions to capture the phenomenon as a whole. Starting from a behavioral analysis, we use a data-driven approach to elicit a Bayesian network which infers the different percepts and dynamics of the process. Context-specific independence analyses enable us to use the model's structure to directly explore how different contexts affect the way subjects handle the same available information. Hence, we establish that, while the percepts yielded by a unisensory stimulus or by the non-fusion of multisensory stimuli may be similar, they result from different processes, as shown by their differing temporal dynamics. Moreover, our model predicts the impact of bottom-up (stimulus driven) factors as well as of top-down factors (induced by instruction manipulation) on both the perception process and the percept itself.