[Interest] Cross platform accelerated instructions framework
Nuno Santos
nunosantos at imaginando.pt
Thu May 14 17:13:36 CEST 2015
Thanks for all the insights so far.
I confess i’m a newbie regarding performance optimisation. I’m writing a synthesiser. It’s computing audio on a real time basis and it’s already getting heavy for 8 voices of polyphony. It’s making an iPad 2 work on the limit which is not good, specially when I interact with user interface and it starts glitching.
I think there is a lot of optimisation margin specially regarding to code structuring but i’m not sure.
Maybe first I should be able to optimize the code for maximum performance using the compiler only. I have a lot of encapsulation and i’m not sure if this is good for optimisations. For example, the following function calculates the output of one of the synthesiser voices. Sorry for the long code listing, but maybe someone could point me basic errors i’m doing that will completely compromise compiler optimisations.
Of course that for vectorisation I will need to identify opportunities and refactor the data structure to make the vectorisation possible. But, who knows i’m doing terrible things that could spare me a nice bunch of important CPU cycles?
(this is by far the longest function in the whole program)
// typedef float IAudioSample
IAudioSample IBasicSynthVoice::step()
{
IAudioSample output=0;
IAudioSample filterModulation=0;
IAudioSample pitchModulationSum=0;
IAudioSample tmp1=0,tmp2=0;
float eg1 = _eg[0].step();
float eg2 = _eg[1].step();
eg1 += eg1*_modWheelMultiplier[MODWHEEL_EG_1];
eg2 += eg2*_modWheelMultiplier[MODWHEEL_EG_2];
// applying pitch modulation
switch (_pitchModulationSource[0])
{
case 1:
tmp1 += _lfo1;
tmp1 += _lfo1;
break;
case 2:
tmp1 += eg1;
tmp1 += eg1;
tmp1 += eg1;
tmp1 += eg1;
break;
case 3:
tmp1 += _lfo1;
tmp1 += _lfo1;
tmp1 += eg1;
tmp1 += eg1;
tmp1 += eg1;
tmp1 += eg1;
break;
default:
break;
}
tmp1 *= _pitchModulationAmount[0];
switch (_pitchModulationSource[1])
{
case 1:
tmp2 += _lfo2;
tmp2 += _lfo2;
break;
case 2:
tmp2 += eg2;
tmp2 += eg2;
tmp2 += eg2;
tmp2 += eg2;
break;
case 3:
tmp2 += _lfo2;
tmp2 += _lfo2;
tmp2 += eg2;
tmp2 += eg2;
tmp2 += eg2;
tmp2 += eg2;
break;
default:
break;
}
tmp2 *= _pitchModulationAmount[1];
pitchModulationSum = (tmp1+tmp2)/12.f;
pitchModulationSum *= _noteFrequency;
float _osc1PitchModulation = 0;
float _osc2PitchModulation = 0;
float _subPitchModulation = 0;
switch (_pitchModulationDestination)
{
case 1:
_osc1PitchModulation += pitchModulationSum;
break;
case 2:
_osc2PitchModulation += pitchModulationSum;
break;
case 3:
_osc1PitchModulation += pitchModulationSum;
_osc2PitchModulation += pitchModulationSum;
break;
case 4:
_subPitchModulation += pitchModulationSum;
break;
case 5:
_osc1PitchModulation += pitchModulationSum;
_subPitchModulation += pitchModulationSum;
break;
case 6:
_osc2PitchModulation += pitchModulationSum;
_subPitchModulation += pitchModulationSum;
break;
case 7:
_osc1PitchModulation += pitchModulationSum;
_osc2PitchModulation += pitchModulationSum;
_subPitchModulation += pitchModulationSum;
break;
}
if (_pitchBendDestination[PITCHBEND_OSC_1])
{
_osc1PitchModulation += _pitchBendMultiplier*_noteFrequency;
_subPitchModulation += _pitchBendMultiplier*_noteFrequency;
}
if (_pitchBendDestination[PITCHBEND_OSC_2])
{
_osc2PitchModulation += _osc2.frequency()*_pitchBendMultiplier;
}
_osc1.setModulation(_osc1PitchModulation);
_osc2.setModulation(_osc2PitchModulation);
_sub.setModulation(_subPitchModulation);
float sub = _sub.step();
float osc1 = _osc1.step();
float osc2 = _osc2.step();
if (_osc2Sync && _osc1.sync())
_osc2.setPhase(0);
float ring = osc1*osc2;
// FM
//_osc1.setModulation(osc2*_crossModulationAmount*2500);
// mixer
output = (ring*_ringAmount);
output += (osc1*_osc1Volume);
output += (osc2*_osc2Volume);
output += (sub*_subVolume);
output += (_noise);
_saturator.process(&output, &output);
//calculateFilterModulation(eg2, osc2);
// applying filter modulation
// modulation amount - eg2
filterModulation += eg2*_filterModulationAmount[0]*(1+_filterModulationAmount[5]*_velocity);
// filter modulation amount - 1 - lfo1
filterModulation += _lfo1*_filterModulationAmount[1]*_filterModulationAmount[1];
// filter modulation amount - 2 - lfo2
filterModulation += _lfo2*_filterModulationAmount[2]*_filterModulationAmount[2];
// filter modulation amount - 3 - vco2
filterModulation += osc2*_filterModulationAmount[3];
// filter modulation amount - 4 - kbd
if (_pitchBendDestination[PITCHBEND_FILTER])
filterModulation += (powf(2, _pitchBendRange*_pitchBend)-1);
_filter.setKeyboardMultiplier(_kbdFilter);
_filter.setModulation(filterModulation);
_filter.process(&output, &output);
// filter modulation amount - 5 - vel
// vca modulation - eg1, eg2, kbd
output *= (eg1*_ampModulationAmount[2]+eg2*_ampModulationAmount[3])*_kbdFilter;
output *= (1+_ampModulationAmount[5]*_velocity);
float ampModulationSum = 0;
// vca modulation - lfo1
ampModulationSum += _lfo1*_ampModulationAmount[0];
// vca modulation - lfo2
ampModulationSum += _lfo2*_ampModulationAmount[1];
if (ampModulationSum>1.5)
ampModulationSum=1.5;
if (ampModulationSum<-1.5)
ampModulationSum=-1.5;
output -= output*ampModulationSum;
return output;
}
Nuno Santos
> On 14 May 2015, at 13:52, Allan Sandfeld Jensen <kde at carewolf.com> wrote:
>
> To write in a way that the compiler can auto-vectorize, write the CPU
> intensive work in simple inner loops without function calls (or only inlined
> ones), use no array access by anything other than the index counter, and also
> avoid branches as much as possible. If you do need branches, write them as
> using conditional assign with c ? a : b.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20150514/277b60c6/attachment.html>
More information about the Interest
mailing list